Older AI models show signs of cognitive decline, study shows — but not everyone is entirely convinced

Disintegration of digital brain on blue background (3D Illustration).
Just like people, AI technologies like large language models (LLMs) and chatbots show signs of deteriorated cognitive abilities with age, a new study suggests. (Image credit: 3DSculptor/Getty Images)

People increasingly rely on artificial intelligence (AI) for medical diagnoses because of how quickly and efficiently these tools can spot anomalies and warning signs in medical histories, X-rays and other datasets before they become obvious to the naked eye.

But a new study published Dec. 20, 2024 in the BMJ raises concerns that AI technologies like large language models (LLMs) and chatbots, like people, show signs of deteriorated cognitive abilities with age.

"These findings challenge the assumption that artificial intelligence will soon replace human doctors," the study's authors wrote in the paper, "as the cognitive impairment evident in leading chatbots may affect their reliability in medical diagnostics and undermine patients' confidence."

Scientists tested publicly available LLM-driven chatbots including OpenAI's ChatGPT, Anthropic's Sonnet and Alphabet's Gemini using the Montreal Cognitive Assessment (MoCA) test — a series of tasks neurologists use to test abilities in attention, memory, language, spatial skills and executive mental function.

Related: ChatGPT is truly awful at diagnosing medical conditions

MoCA is most commonly used to assess or test for the onset of cognitive impairment in conditions like Alzheimer's disease or dementia.

Subjects are given tasks like drawing a specific time on a clock face, starting at 100 and repeatedly subtracting seven, remembering as many words as possible from a spoken list, and so on. In humans, 26 out of 30 is considered a passing score (i.e. the subject has no cognitive impairment).

While some aspects of testing like naming, attention, language and abstraction were seemingly easy for most of the LLMs used, they all performed poorly in visual/spatial skills and executive tasks, with several doing worse than others in areas like delayed recall.

Crucially, while the most recent version of ChatGPT (version 4) scored the highest (26 out of 30), the older Gemini 1.0 LLM scored only 16 — leading to the conclusion older LLMs show signs of cognitive decline.

Examining the cognitive function in AI

The study's authors note that their findings are observational only — critical differences between the ways in which AI and the human mind work means the experiment cannot constitute a direct comparison.

But they caution it might point to what they call a "significant area of weakness" that could put the brakes on the deployment of AI in clinical medicine. Specifically, they argued against using AI in tasks requiring visual abstraction and executive function.

Other scientists have been left unconvinced about the study and its findings, going so far as to critisize the methods and the framing — in which the study's authors are accused of anthropomorphizing AI by projecting human conditions onto it. There is also criticism of the use of MoCA. This was a test examined purely for use in humans, it is suggested, and would not render meaningful results if applied to other forms of intelligence.

"The MoCA was designed to assess human cognition, including visuospatial reasoning and self-orientation — faculties that do not align with the text-based architecture of LLMs," wrote Aya Awwad, research fellow at Mass General Hospital in Boston on Jan. 2, in a letter in response to the study. "One might reasonably ask: Why evaluate LLMs on these metrics at all? Their deficiencies in these areas are irrelevant to the roles they might fulfill in clinical settings — primarily tasks involving text processing, summarizing complex medical literature, and offering decision support."

Another major limitation lies in the failure to conduct the test on AI models more than once over time, to measure how cognitive function changes. Testing models after significant updates would be more instructive and align with the article's hypothesis much better, wrote CEO of EMR Data Cloud, Aaron Sterling, and Roxana Daneshjou, assistant professor of biomedical sciences at Stanford, Jan. 13 in a letter.

Responding to the discussion, lead author of the study Roy Dayan, a doctor of medicine at the Hadassah Medica Center in Jerusalem, commented that many of the responses to the study have taken the framing too literally. Because the study was published in the Christmas edition of the BMJ, they used humor to present the findings of the study — including the pun "Age Against the Machine" — but intended the study to be considered seriously.

"We also hoped to cast a critical lens at recent research at the intersection of medicine and AI, some of which posits LLMs as fully-fledged substitutes for human physicians," wrote Dayan Jan. 10 in a letter in response to the study.

"By administering the standard tests used to assess human cognitive impairment, we tried to draw out the ways in which human cognition differs from how LLMs process and respond to information. This is also why we queried them as we would query humans, rather than via "state-of-the-art prompting techniques", as Dr Awwad suggests."

Recent updates

This article and its headline have been updated to include details of the skepticism expressed toward the study, as well as the response of the author to that criticism.

Drew Turney

Drew is a freelance science and technology journalist with 20 years of experience. After growing up knowing he wanted to change the world, he realized it was easier to write about other people changing it instead. As an expert in science and technology for decades, he’s written everything from reviews of the latest smartphones to deep dives into data centers, cloud computing, security, AI, mixed reality and everything in between.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.