Large artificial intelligence language models, increasingly unreliable

According to a study by the Universitat Politècnica de València, ValgrAI and the University of Cambridge, published in the journal Nature

[ 26/09/2024 ]

Recent advances in artificial intelligence (AI) have led to the widespread use of large language models in our society in fields such as education, science, medicine, art and finance, among many others. These models are increasingly present in our daily lives. However, they are less reliable than users expect. This is the conclusion of a study led by a team from the VRAIN Institute of the Universitat Politècnica de València (UPV) and the Valencian Graduate School and Research Network in Artificial Intelligence (ValgrAI), together with the University of Cambridge, published today in the journal Nature.

The study reveals an ‘alarming’ trend: compared to the first models, and considering certain aspects, reliability has worsened in the most recent models (GPT-4 compared to GPT-3, for example).

According to José Hernández Orallo, a researcher at the Valencian Institute for Research in Artificial Intelligence (VRAIN) of the UPV and ValgrAI, one of the main concerns about the reliability of language models is that their performance does not match the human perception of task difficulty. In other words, there is a mismatch between expectations that the models will fail based on human perception of task difficulty and the tasks on which the models fail. 'Models can solve certain complex tasks in line with human abilities, but at the same time, they fail on simple tasks in the same domain. For example, they can solve several PhD-level mathematical problems. Still, they can get a simple addition wrong,' notes Hernández-Orallo.

In 2022, Ilya Sutskever, the scientist behind some of the most significant advances in artificial intelligence in recent years (from the Imagenet solution to AlphaGo) and co-founder of OpenAI, predicted that ‘maybe over time that discrepancy will diminish’.

However, the study by the UPV, ValgrAI and Cambridge University team shows this has not been the case. To demonstrate this, they investigated three key aspects that affect the reliability of language models from a human perspective.

There is no ‘safe zone’ in which models work perfectly

The study finds a discordance with perceptions of difficulty. ‘Do models fail where we expect them to fail? Our work finds that models tend to be less accurate on tasks that humans consider difficult, but they are not 100% accurate even on simple tasks. This means that there is no ‘safe zone’ in which models can be trusted to work perfectly,' says researcher of the VRAIN Institute, Yael Moros Daval.

In fact, the team from the VRAIN UPV Institute, ValgrAI and the University of Cambridge assures that the most recent models basically improve their performance in tasks of high difficulty but not in tasks of low difficulty, 'which aggravates the difficulty mismatch between the performance of the models and human expectations', adds Fernando Martínez Plumed, also a researcher at VRAIN UPV.

Sensitivity to the problem statement

Is the effectiveness of question formulation affected by the difficulty of the questions? This is another issue addressed by the UPV, ValgrAI and Cambridge study, which concludes that the current trend of progress in the development of language models and greater understanding of a variety of commands may not free users from worrying about making effective statements. ‘We have found that users can be influenced by prompts that work well in complex tasks but, at the same time, get incorrect answers in simple tasks,’ adds Cèsar Ferri, co-author of the study and researcher at VRAIN UPV and ValgrAI.

Human supervision unable to compensate for these problems

In addition to these findings on aspects of the unreliability of language models, the researchers have discovered that human supervision is unable to compensate for these problems. For example, people can recognise tasks of high difficulty but still frequently consider incorrect results correct in this area, even when they are allowed to say 'I'm not sure', indicating overconfidence.

From ChatGPT to LLaMA and BLOOM

The results were similar for multiple families of language models, including OpenAI's GPT family, Meta's open-weighted LLaMA, and BLOOM, a fully open initiative from the scientific community.

Researchers have further found that issues of difficulty mismatch, lack of proper abstention, and prompt sensitivity remain problematic for new versions of popular families, such as OpenAI's new o1 and Anthropic's Claude-3.5-Sonnet models.

'Ultimately, large language models are becoming increasingly unreliable from a human point of view, and user supervision to correct errors is not the solution, as we tend to rely too much on models and cannot recognise incorrect results at different difficulty levels. Therefore, a fundamental change is needed in the design and development of general-purpose AI, especially for high-risk applications, where predicting the performance of language models and detecting their errors is paramount,' concludes Wout Schellaert, a researcher at the VRAIN UPV Institute.

Reference

Zhou, L., Schellaert, W., Martínez-Plumed, F. et al. Larger and more instructable language models become less reliable. Nature (2024). https://doi.org/10.1038/s41586-024-07930-y

Outstanding news

The Diamond Army
Two students came up with the UPV initiative that has engaged more than 1,600 volunteers and shattered the false myth of the 'crystal generation'

ARWU 2024
The Shanghai ranking reaffirms the UPV as the best polytechnic in Spain for yet another year

Distinction of the Generalitat for Scientific Merit
Guanter has been distinguished in recognition of his research excellence in the development of satellite methods for environmental applications

The new statutes come into force
The Universitat Politècnica de València is the first university in Spain with statutes adapted to the new LOSU

NanoNIR project against breast cancer
UPV Researcher Carla Arnau del Valle receives an EU Marie Curie grant to develop biosensors for the early detection of this cancer