Older AI models show signs of cognitive decline, study shows — but not everyone is entirely convinced

Disintegration of digital brain on blue background (3D Illustration).

Just like people, AI technologies like large language models (LLMs) and chatbots show signs of deteriorated cognitive abilities with age, a new study suggests. (Image credit: 3DSculptor/Getty Images)

People increasingly rely on artificial intelligence (AI) for medical diagnoses because of how quickly and efficiently these tools can spot anomalies and warning signs in medical histories, X-rays and other datasets before they become obvious to the naked eye.

But a new study published Dec. 20, 2024 in the BMJ raises concerns that AI technologies like large language models (LLMs) and chatbots, like people, show signs of deteriorated cognitive abilities with age.

"These findings challenge the assumption that artificial intelligence will soon replace human doctors," the study's authors wrote in the paper, "as the cognitive impairment evident in leading chatbots may affect their reliability in medical diagnostics and undermine patients' confidence."

Scientists tested publicly available LLM-driven chatbots including OpenAI's ChatGPT, Anthropic's Sonnet and Alphabet's Gemini using the Montreal Cognitive Assessment (MoCA) test — a series of tasks neurologists use to test abilities in attention, memory, language, spatial skills and executive mental function.

MoCA is most commonly used to assess or test for the onset of cognitive impairment in conditions like Alzheimer's disease or dementia.

Subjects are given tasks like drawing a specific time on a clock face, starting at 100 and repeatedly subtracting seven, remembering as many words as possible from a spoken list, and so on. In humans, 26 out of 30 is considered a passing score (i.e. the subject has no cognitive impairment).

While some aspects of testing like naming, attention, language and abstraction were seemingly easy for most of the LLMs used, they all performed poorly in visual/spatial skills and executive tasks, with several doing worse than others in areas like delayed recall.

Crucially, while the most recent version of ChatGPT (version 4) scored the highest (26 out of 30), the older Gemini 1.0 LLM scored only 16 — leading to the conclusion older LLMs show signs of cognitive decline.

Examining the cognitive function in AI

The study's authors note that their findings are observational only — critical differences between the ways in which AI and the human mind work means the experiment cannot constitute a direct comparison.

But they caution it might point to what they call a "significant area of weakness" that could put the brakes on the deployment of AI in clinical medicine. Specifically, they argued against using AI in tasks requiring visual abstraction and executive function.

Other scientists have been left unconvinced about the study and its findings, going so far as to critisize the methods and the framing — in which the study's authors are accused of anthropomorphizing AI by projecting human conditions onto it. There is also criticism of the use of MoCA. This was a test examined purely for use in humans, it is suggested, and would not render meaningful results if applied to other forms of intelligence.

"The MoCA was designed to assess human cognition, including visuospatial reasoning and self-orientation — faculties that do not align with the text-based architecture of LLMs," wrote Aya Awwad, research fellow at Mass General Hospital in Boston on Jan. 2, in a letter in response to the study. "One might reasonably ask: Why evaluate LLMs on these metrics at all? Their deficiencies in these areas are irrelevant to the roles they might fulfill in clinical settings — primarily tasks involving text processing, summarizing complex medical literature, and offering decision support."

Recent updates

This article and its headline have been updated to include details of the skepticism expressed toward the study, as well as the response of the author to that criticism.

Drew is a freelance science and technology journalist with 20 years of experience. After growing up knowing he wanted to change the world, he realized it was easier to write about other people changing it instead. As an expert in science and technology for decades, he’s written everything from reviews of the latest smartphones to deep dives into data centers, cloud computing, security, AI, mixed reality and everything in between.