Mathematicians devised novel problems to challenge advanced AIs' reasoning skills — and they failed almost every test

The researchers tested six state-of-the-art AI models against the new benchmark and the best score registered by a single system was 2%. (Image credit: hh5800/Getty Images)

Mathematicians have stumped the most advanced generative artificial intelligence (AI) models with a series of mind-bending new math problems.

These problems typically require doctorate-level mathematicians hours to days to solve, according to the research institute Epoch AI. But in the new tests, the most advanced AI models on the market got correct answers on less than 2% of these problems.

In the past decade, a number of AI tests have been developed to determine whether the answers these models return are actually correct. In many cases, AI models now breeze through these benchmarks.

For example, in the commonly used Measuring Massive Multitask Language Understanding (MMLU) benchmark test, today's AI models answer 98% of math problems correctly.

Most of these benchmarks are geared toward testing AI's ability to do high-school and college-level math, Elliot Glazer, a mathematician at Epoch AI, and colleagues wrote in a new paper posted on the preprint database arXiv. (The paper has not yet been peer-reviewed or published in a scientific journal.)

The new set of benchmarks, called FrontierMath, aims for a higher level of reasoning. Epoch AI developed the questions with the help of mathematics professors, including some winners of the Fields Medal, perhaps the most prestigious prize in math. The problems cover a wide range of subfields, from number theory to algebraic geometry, and are available on Epoch AI's website.

"These are extremely challenging," 2006 Fields Medal winner Terence Tao, a mathematician at UCLA, wrote in a review of the problems for Epoch AI. "I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages."

The problems were also unique — a step taken to ensure that none of the problems were already in the AI models' training data. When complex reasoning problems are included in the training data, the AI may appear to solve the problems, but in reality, it already has a "cheat sheet," since it has been trained on the answers.

The researchers tested six state-of-the-art AI models: Google's Gemini 1.5 Pro (002), Anthropic's Claude 3.5 Sonnet, OpenAI's o1-preview, o1-mini, and GPT4o and xAI's Grok-2 Beta. Gemini and Claude managed to solve 2%, which was just slightly better than the showings from o1-preview, o1-mini and GPT-4o's 1%. Grok-2 Beta failed to get any problems right.

However, these rankings are misleading because the low success rate means that a single right answer can have an outsize impact on each model's overall score, the researchers cautioned.

—New Chinese AI model 'better than industry leader' in key metrics

—'Student of Games' is the 1st AI that can master different types of games, like chess and poker

"[E]ven when a model obtained the correct answer, this does not mean that its reasoning was correct," the paper authors wrote. "For instance, on one of these problems running a few simple simulations was sufficient to make accurate guesses without any deeper mathematical understanding. However, models' low overall accuracy shows that such guessing strategies do not work on the overwhelming majority of FrontierMath problems."

The findings show that right now, AI models don't possess research-level math reasoning, Epoch AI's collaborators concluded. However, as AI models advance, these benchmark tests will provide a way to find out if their reasoning abilities are deepening.

"By regularly evaluating state-of-the-art models and collaborating with the AI research community," the team wrote in the statement, "we aim to deepen our understanding of AI’s capabilities and limitations."

Stephanie Pappas is a contributing writer for Live Science, covering topics ranging from geoscience to archaeology to the human brain and behavior. She was previously a senior writer for Live Science but is now a freelancer based in Denver, Colorado, and regularly contributes to Scientific American and The Monitor, the monthly magazine of the American Psychological Association. Stephanie received a bachelor's degree in psychology from the University of South Carolina and a graduate certificate in science communication from the University of California, Santa Cruz.

More about artificial intelligence

GPT-4.5 is the first AI model to pass an authentic Turing test, scientists say

AI creates better and funnier memes than people, study shows — even when people use AI for help

Simple blood test could reveal likelihood of deadly skin cancer returning, study suggests

See more latest