"Recent advances in Artificial Intelligence based on systems that require vast amounts of data and computation, such as GPT-4, have highlighted the difficulties in understanding the capabilities and weaknesses of these AI systems. We need to find out where these systems are safe to use and how they could be improved. And this is due to the way AI is assessed today, which needs to change urgently”.
Behind these words are 16 of the world's leading experts in Artificial Intelligence, including researchers from the VRAIN Institute of the Universitat Politècnica de València (UPV), José Hernández-Orallo, Fernando Martínez Plumed and Wout Schellaert.
Coordinated by Professor Hernández-Orallo, the 16 researchers have published a letter today in Science Journal in which they call for a "rethink" of the evaluation of AI tools to move towards more transparent models, know about their effectiveness, and their actual capabilities, what they can do and what not.
In their paper, the authors propose a roadmap for AI models in which their results are presented more nuanced, and the case-by-case evaluation results are publicly available.
As Hernández-Orallo explains, the performance of an AI model is measured with aggregated statistics. And this poses a risk because while they can give a picture of good overall performance, they can also hide low reliability/usefulness in specific, more minority cases, "and yet it is implied that it is equally valid in all cases when in fact it is not".
In the paper, the authors explain this with the case of AI models of clinical diagnostic aids and point out that these systems may have a problem when analysing people of a particular ethnicity or demographic group because these are cases that constitute only a small proportion of their training.
"We are asking that whenever an AI result is published, it should be broken down as much as possible to know its real usefulness and reproduce the analysis. In the article published in Science, we also talked about an AI facial recognition system that gave a 90% accuracy rate, and then it was found that for white men, the accuracy rate was 99.2%, but for black women, it was only 65.5%. This is why sometimes the results sold about an AI tool's usefulness are not entirely transparent and reliable. If they don't give you the detail, you think the models work very well, and that's not the reality. Not having that breakdown with all the possible information about the AI model means that applying it could entail risks," says José Hernández-Orallo.
The VRAIN UPV researcher stresses that the proposed changes can improve the understanding of AI. And also to reduce the "voracious" competition among AI labs to announce that their model improves previous systems by a certain percentage.
"Some labs want to go from 93% to 95% no matter what, which goes against AI's ultimate applicability and reliability. What we want, in short, is to contribute to a better understanding, among all of us, of how AI works, what the limitations of each model are, to guarantee the correct use of this technology," concludes Hernández-Orallo.
Along with researchers from the VRAIN Institute of the Universitat Politècnica de València, research staff from the University of Cambridge, Harvard University, the Massachusetts Institute of Technology (MIT), Stanford University, Google, Imperial College London, the University of Leeds, the Alan Turing Institute in London, Deepmind, the US National Institute of Standards and Technology (NIST), the Santa Fe Institute, Tongji University in Shanghai and Shandong University in Jinan have also participated in this article.
Reference
Ryan Burnell et al.Rethink reporting of evaluation results in AI.Science380, 136-138(2023).DOI:10.1126/science.adf6369
Outstanding news