Scientists design new 'AGI benchmark' that indicates whether any future AI model could cause 'catastrophic harm'

A digital brain with waves passing through it

OpenAI scientists designed MLE-bench to measure how well AI models perform at "autonomous machine learning engineering" — which is among the hardest tests an AI can face. (Image credit: Getty Images/Naeblys)

Scientists have designed a new set of tests that measure whether artificial intelligence (AI) agents can modify their own code and improve its capabilities without human instruction.

The benchmark, dubbed "MLE-bench," is a compilation of 75 Kaggle tests, each one a challenge that tests machine learning engineering. This work involves training AI models, preparing datasets, and running scientific experiments, and the Kaggle tests measure how well the machine learning algorithms perform at specific tasks.

OpenAI scientists designed MLE-bench to measure how well AI models perform at "autonomous machine learning engineering" — which is among the hardest tests an AI can face. They outlined the details of the new benchmark Oct. 9 in a paper uploaded to the arXiv preprint database.

Any future AI that scores well on the 75 tests that comprise MLE-bench may be considered powerful enough to be an artificial general intelligence (AGI) system — a hypothetical AI that is much smarter than humans — the scientists said.

Each of the 75 MLE-bench tests holds real-world practical value. Examples include OpenVaccine — a challenge to find an mRNA vaccine for COVID-19 — and the Vesuvius Challenge for deciphering ancient scrolls.

If AI agents learn to perform machine learning research tasks autonomously, it could have numerous positive impacts such as accelerating scientific progress in healthcare, climate science, and other domains, the scientists wrote in the paper. But, if left unchecked, it could lead to unmitigated disaster.

"The capacity of agents to perform high-quality research could mark a transformative step in the economy. However, agents capable of performing open-ended ML research tasks, at the level of improving their own training code, could improve the capabilities of frontier models significantly faster than human researchers," the scientists wrote. "If innovations are produced faster than our ability to understand their impacts, we risk developing models capable of catastrophic harm or misuse without parallel developments in securing, aligning, and controlling such models."

They added that any model that could solve a "large fraction" of MLE-bench can likely execute many open-ended machine learning tasks by itself.

—'Their capacity to emulate human language and thought is immensely powerful': Far from ending the world, AI systems might actually save it

—Humanity faces a 'catastrophic' future if we don’t regulate AI, 'Godfather of AI' Yoshua Bengio says

The scientists tested OpenAI's most powerful AI model designed so far — known as "o1." This AI model achieved at least the level of a Kaggle bronze medal on 16.9% of the 75 tests in MLE-bench. This figure improved the more attempts o1 was given to take on the challenges.

Earning a bronze medal is the equivalent of being in the top 40% of human participants in the Kaggle leaderboard. OpenAI's o1 model achieved an average of seven gold medals on MLE-bench, which is two more than a human is needed to be considered a "Kaggle Grandmaster." Only two humans have ever achieved medals in the 75 different Kaggle competitions, the scientists wrote in the paper.

The researchers are now open-sourcing MLE-bench to spur further research into the machine learning engineering capabilities of AI agents — essentially allowing other researchers to test their own AI models against MLE-bench. "Ultimately, we hope our work contributes to a deeper understanding of the capabilities of agents in autonomously executing ML engineering tasks, which is essential for the safe deployment of more powerful models in the future," they concluded.

Keumars is the technology editor at Live Science. He has written for a variety of publications including ITPro, The Week Digital, ComputerActive, The Independent, The Observer, Metro and TechRadar Pro. He has worked as a technology journalist for more than five years, having previously held the role of features editor with ITPro. He is an NCTJ-qualified journalist and has a degree in biomedical sciences from Queen Mary, University of London. He's also registered as a foundational chartered manager with the Chartered Management Institute (CMI), having qualified as a Level 3 Team leader with distinction in 2023.

More about artificial intelligence

GPT-4.5 is the first AI model to pass an authentic Turing test, scientists say

AI creates better and funnier memes than people, study shows — even when people use AI for help

World's first baby conceived with 'automated IVF' has been born

See more latest