Benchmark Example Math

AI’s math problem: FrontierMath benchmark shows how far technology still has to go

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Artificial intelligence systems may be good ...

Unite.AI

From Math Exams to Machine Reasoning: AI’s Latest Struggles

Recently, Artificial Intelligence (AI) has reached a historic milestone in one of the world's toughest math contests, the International Mathematical Olympiad (IMO). Google DeepMind’s Gemini Deep Think ...

Ars Technica

New secret math benchmark stumps AI models and PhDs alike

On Friday, research organization Epoch AI released FrontierMath, a new mathematics benchmark that has been turning heads in the AI world because it contains hundreds of expert-level problems that ...

Bleeping Computer

Grok 4 benchmark results: Tops math, ranks second in coding

Grok 4 is a huge leap from Grok 3, but how good is it compared to other models in the market, such as Gemini 2.5 Pro? We now have answers, thanks to new independent benchmarks. LMArena.ai, which is an ...

16don MSN

AI models are already as good as experts at half of tasks, a new OpenAI benchmark suggests

Anthropic's Claude Opus 4.1 excelled at many professional tasks, especially those performed by clerks, software developers, and private investigators ...

TechSpot

Move over math and reasoning, it's time to benchmark AI using Super Mario Bros.

The big picture: Benchmarking AI remains a thorny issue, with companies often accused of cherry-picking flattering results while burying less favorable ones. Instead of fixating on math and logic ...

Ars Technica

New secret math benchmark stumps AI models and PhDs alike

I think of an AI as a script kiddie. A very good script kiddie, but never the less a basic script kiddie, If it hasnt seen the script for the answer, then it can't give the answer. In other words, an ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results