Recent advances in artificial intelligence (AI) have led to the widespread use of large language models in our society in fields such as education, science, medicine, art and finance, among many others. These models are increasingly present in our daily lives. However, they are less reliable than users expect. This is the conclusion of a study led by a team from the VRAIN Institute of the Universitat Politècnica de València (UPV) and the Valencian Graduate School and Research Network in Artificial Intelligence (ValgrAI), together with the University of Cambridge, published today in the journal Nature.
The study reveals an ‘alarming’ trend: compared to the first models, and considering certain aspects, reliability has worsened in the most recent models (GPT-4 compared to GPT-3, for example).
According to José Hernández Orallo, a researcher at the Valencian Institute for Research in Artificial Intelligence (VRAIN) of the UPV and ValgrAI, one of the main concerns about the reliability of language models is that their performance does not match the human perception of task difficulty. In other words, there is a mismatch between expectations that the models will fail based on human perception of task difficulty and the tasks on which the models fail. 'Models can solve certain complex tasks in line with human abilities, but at the same time, they fail on simple tasks in the same domain. For example, they can solve several PhD-level mathematical problems. Still, they can get a simple addition wrong,' notes Hernández-Orallo.
In 2022, Ilya Sutskever, the scientist behind some of the most significant advances in artificial intelligence in recent years (from the Imagenet solution to AlphaGo) and co-founder of OpenAI, predicted that ‘maybe over time that discrepancy will diminish’.
However, the study by the UPV, ValgrAI and Cambridge University team shows this has not been the case. To demonstrate this, they investigated three key aspects that affect the reliability of language models from a human perspective.
The study finds a discordance with perceptions of difficulty. ‘Do models fail where we expect them to fail? Our work finds that models tend to be less accurate on tasks that humans consider difficult, but they are not 100% accurate even on simple tasks. This means that there is no ‘safe zone’ in which models can be trusted to work perfectly,' says researcher of the VRAIN Institute, Yael Moros Daval.
In fact, the team from the VRAIN UPV Institute, ValgrAI and the University of Cambridge assures that the most recent models basically improve their performance in tasks of high difficulty but not in tasks of low difficulty, 'which aggravates the difficulty mismatch between the performance of the models and human expectations', adds Fernando Martínez Plumed, also a researcher at VRAIN UPV.
The study also finds that recent language models are much more likely to provide incorrect answers rather than avoid giving answers to tasks they are unsure of. 'This can lead users who initially rely too much on the models to be disappointed. Moreover, unlike people, the tendency to avoid providing answers does not increase with difficulty. For example, humans tend to avoid giving feedback on problems beyond their capacity. This puts the onus on users to detect faults during all their interactions with models,' adds Lexin Zhou, a member of the VRAIN team who was also involved in this work.
Is the effectiveness of question formulation affected by the difficulty of the questions? This is another issue addressed by the UPV, ValgrAI and Cambridge study, which concludes that the current trend of progress in the development of language models and greater understanding of a variety of commands may not free users from worrying about making effective statements. ‘We have found that users can be influenced by prompts that work well in complex tasks but, at the same time, get incorrect answers in simple tasks,’ adds Cèsar Ferri, co-author of the study and researcher at VRAIN UPV and ValgrAI.
In addition to these findings on aspects of the unreliability of language models, the researchers have discovered that human supervision is unable to compensate for these problems. For example, people can recognise tasks of high difficulty but still frequently consider incorrect results correct in this area, even when they are allowed to say 'I'm not sure', indicating overconfidence.
The results were similar for multiple families of language models, including OpenAI's GPT family, Meta's open-weighted LLaMA, and BLOOM, a fully open initiative from the scientific community.
Researchers have further found that issues of difficulty mismatch, lack of proper abstention, and prompt sensitivity remain problematic for new versions of popular families, such as OpenAI's new o1 and Anthropic's Claude-3.5-Sonnet models.
'Ultimately, large language models are becoming increasingly unreliable from a human point of view, and user supervision to correct errors is not the solution, as we tend to rely too much on models and cannot recognise incorrect results at different difficulty levels. Therefore, a fundamental change is needed in the design and development of general-purpose AI, especially for high-risk applications, where predicting the performance of language models and detecting their errors is paramount,' concludes Wout Schellaert, a researcher at the VRAIN UPV Institute.
Zhou, L., Schellaert, W., Martínez-Plumed, F. et al. Larger and more instructable language models become less reliable. Nature (2024). https://doi.org/10.1038/s41586-024-07930-y
Outstanding news