AI can handle tasks twice as complex every few months. What does this exponential growth mean for how we use it?

AIs can outperform humans easily on short tasks, but longer ones are the true hurdle to overcome before we can deem them to be truly intelligent systems.

By Roland Moore-Colyer

Published 27 April 2025 In News

an illustration of a line of robots working on computers — *A new benchmark for AI performance could give us an idea of when to expect true generalist AI agents.*

(Image credit: MASTER via Getty Images)

Scientists have devised a new way to measure how capable artificial intelligence (AI) systems are — how fast they can beat, or compete with, humans in challenging tasks.

While AIs can generally outperform humans in text prediction and knowledge tasks, when given more substantive projects to carry out, such as remote executive assistance, they are less effective.

"We find that measuring the length of tasks that models can complete is a helpful lens for understanding current AI capabilities. This makes sense: AI agents often seem to struggle with stringing together longer sequences of actions more than they lack skills or knowledge needed to solve single steps," the researchers from AI organization Model Evaluation & Threat Research (METR) explained in a blog post accompanying the study.

To conduct their study, the researchers took a variety of AI models — from Sonnet 3.7 and GPT-4 to Claude 3 Opus and older GPT models — and pitted them against a suite of tasks. These ranged from easy assignments that typically take humans a couple of minutes like looking up a basic factual question on Wikipedia) to ones that take human experts multiple hours — complex programming tasks like writing CUDA kernels or fixing a subtle bug in PyTorch, for example.

"Measuring AI against the length of time it takes a human to accomplish a given task is an interesting proxy metric for intelligence and general capabilities,” Kazerounian said. “First, because there is no singular metric that captures what we mean when we say "intelligence." Second, because the likelihood of carrying out a prolonged task without drift or error becomes vanishingly small. Third, because it is a direct measure against the types of tasks we hope to make use of AI for; namely solving complex human problems. While it might not capture all the relevant factors or nuances about AI capabilities, it is certainly a useful datapoint," he added.

TOPICS

Roland Moore-Colyer is a freelance writer for Live Science and managing editor at consumer tech publication TechRadar, running the Mobile Computing vertical. At TechRadar, one of the U.K. and U.S.’ largest consumer technology websites, he focuses on smartphones and tablets. But beyond that, he taps into more than a decade of writing experience to bring people stories that cover electric vehicles (EVs), the evolution and practical use of artificial intelligence (AI), mixed reality products and use cases, and the evolution of computing both on a macro level and from a consumer angle.

AI can handle tasks twice as complex every few months. What does this exponential growth mean for how we use it?

A new frontier for assessing AI?

Generalist AI is coming

RELATED STORIES