Most AI Safety Tests Contain Major Flaws, Study Finds
Researchers from the UK’s AI Security Institute, together with experts from Stanford, Berkeley and Oxford, found significant weaknesses in hundreds of benchmarks used to assess the safety and performance of new AI models. More than 440 tests were examined, and almost all showed structural flaws that undermine their reliability, making many results misleading or meaningless.
With no comprehensive regulation in the UK or US, major tech firms rely on these benchmarks to validate their models, even though faulty metrics can exaggerate capabilities or hide risks. Google’s withdrawal of its Gemma model—after it generated defamatory allegations about a US senator—illustrates how fragile these evaluation systems can be.
Experts warn that without shared standards, it becomes impossible to know whether AI systems are genuinely improving or merely appearing to do so. The study shows that without trustworthy testing, claims about AI safety remain on shaky ground.