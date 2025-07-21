Highlights
AI benchmarks are useful in assessing AI model performance. But when most developers report high scores, benchmarks become less meaningful.
A recent study found that some large AI companies privately test multiple versions of their models on the Chatbot Arena benchmark and selectively release the best results, inflating scores.
Researchers recommend that businesses use tools like YourBench to create customized benchmarks tailored to their own data and tasks, rather than relying solely on general-purpose benchmarks.
Every time an artificial intelligence (AI) model is introduced, the company that created it — whether Google, Anthropic, Meta, xAI or others — often points out that it has topped several AI benchmarks.