AI Math Benchmarks: AI's Growing Capabilities

Mathematics is often regarded as the ideal domain for measuring AI progress effectively. Math’s step-by-step logic is easy to track, and its definitive automatically verifiable answers remove any human or subjective factors. But AI systems are improving at such a pace that math benchmarks are struggling to keep up.

Way back in November 2024, non-profit research organization Epoch AI quietly released Frontier Math. A standardized, rigorous benchmark, Frontier Math was designed to measure the mathematical reasoning capabilities of the latest AI tools.

“It’s a bunch of really hard math problems,” explains Greg Burnham, Epoch AI Senior Researcher. “Originally, it was 300 problems that we now call tiers 1–3, but having seen AI capabilities really speed up, there was a feeling that we had to run to stay ahead, so now there’s a special challenge set of extra carefully constructed problems that we call tier 4.”

To a rough approximation, tiers 1–4 go from advanced undergraduate through to early postdoc level mathematics. When introduced, state-of-the-art AI models were unable to solve more than 2% of the problems Frontier Math contained. Fast forward to today and the best publicly available AI models, such as ChatGPT 5.2 Pro and Claude Opus 4.6, are solving over 40% of Frontier Math’s 300 tiers 1–3 problems, and over 30% of the 50 tier 4 problems.

AI takes on PhD level mathematics

And this dizzying pace of advancement is showing no signs of abating. For example, just recently Google DeepMind announced that Aletheia, an experimental AI system derived from Gemini Deep Think, achieved publishable PhD level research results. Though obscure mathematically—calculating certain structure constants in arithmetic geometry called eigenweights—the result is significant in terms of AI development.

“They’re claiming it was essentially autonomous, meaning a human wasn’t guiding the work, and it’s publishable,” Burnham says. “It’s definitely at the lower end of the spectrum of work that would get a mathematician excited, but it’s new—it’s something we truly haven’t really seen before.”

To place this achievement in context, every Frontier Math problem has a known answer that a human has derived. Though a human could probably have achieved Aletheia’s result “if they sat down and steeled themselves for a week,” says Burnham, no human had ever done so.

Aletheia’s results and other recent achievements by AI mathematicians point to new, tougher benchmarks being needed to understand AI capabilities, and fast, because existing ones will soon become irrelevant. “There are easier math benchmarks that are already obsolete, several generations of them,” says Burnham. “Frontier Math will probably saturate [meaning state-of-the-art AI models score 100%] within the next two years; could be faster.”

The First Proof challenge

To begin to address this problem, on February 6, a group of 11 highly distinguished mathematicians proposed the First Proof challenge, a set of 10 extremely difficult math questions which arose naturally in the authors’ research processes, and whose proofs are roughly five pages or less and had not been shared with anyone. The First Proof challenge was a preliminary effort to assess the capabilities of AI systems in solving research-level math questions on their own.

Generating serious buzz in the math community, professional and amateur mathematicians, and teams including OpenAI, all stepped up to the challenge. But by the time the authors posted the proofs on February 14, no one had submitted correct solutions to all 10 problems.

In fact, far from it. The authors themselves only solved two of the 10 problems using Gemini 3.0 Deep Think and ChatGPT 5.2 Pro. And most outside submissions fared little better, apart from OpenAI. With “limited human supervision” OpenAI’s most advanced internal AI system solved five of the 10 problems—a result met with a spectrum of emotions by different members of the mathematics community, from awe to disappointment. The team behind First Proof plans an even tougher second round on March 14.

A new frontier for AI

“I think First Proof is terrific: it’s as close as you could realistically get to putting an AI system in the shoes of a mathematician,” says Burnham. Though he admires how First Proof tests AI’s mathematical utility for a wide range of mathematics and mathematicians, Epoch AI has its own new approach to testing—Frontier Math: Open Problems. Uniquely, the pilot benchmark consists of 14 open problems (with more to follow) from research mathematics that professional mathematicians have tried and failed to solve. Since Open Problems’ release on January 27, none have been solved by an AI.

“With Open Problems, we’ve tried to make it more challenging,” says Burnham. “The baseline on its own would be publishable, at least in a specialty journal.” What’s more, each question is designed so that it can be automatically graded. “This is a bit counterintuitive,” Burnham adds. “No one knows the answers, but we have a computer program that will be able to judge whether the answer is right or not.”

Burnham sees First Proof and Open Problems as being complementary. “I would say understanding AI capabilities is a more-the-merrier situation,” he adds. “AI has gotten to the point where it’s, in some ways, better than most PhD students, so we need to pose problems where the answer would be at least moderately interesting to some human mathematicians, not because AI was doing it, but because it’s mathematics that human mathematicians care about.”

From Your Site Articles

AI Math Benchmarks: AI’s Growing Capabilities

GTA 6: How much it is, release date, pre-orders and everything you need to know

WhatsApp to be led by Indian start-up founder as Will Cathcart steps back

GTA 6 will cost £70 and physical edition will contain no disc

OpenAI and Broadcom Unveil Custom A.I. Chip Design

Microsoft’s claims over its quantum chip questioned in Nature article

Hydration tracking: Should you be tracking your water level?

The Alleged Betrayal Behind Taylor Swift’s Lost Friend

Are the Warriors back and ready to compete for another title?

Newsom – Typical Democrat | Armstrong Economics

Google Parent Alphabet Reports 12% Increase in Revenue

Poland’s Death Wish? | Armstrong Economics

Editors Picks

Temperature records smashed as extreme heatwave grips Europe

GTA 6: How much it is, release date, pre-orders and everything you need to know

A Foreign-Born Judge Says States Cannot Verify Citizenship Before Elections

Taylor Swift And Travis Kelce Ripped Over ‘Tacky’ Wedding Move

AI Math Benchmarks: AI’s Growing Capabilities

AI takes on PhD level mathematics

The First Proof challenge

A new frontier for AI

Keep Reading