AI News

OpenAI Language Models Hallucinate Due to Evaluation Incentives, Research Shows

OpenAI research reveals that language models hallucinate because standard evaluations reward guessing over acknowledging uncertainty. The study proposes fixing accuracy-based scoreboards to reduce confident errors in AI

LLMBase Editorial Updated September 5, 2025 3 min read
ai llm industry hallucinations evaluation benchmarks

The study examined hallucination patterns across OpenAI's model lineup, including ChatGPT and the upcoming GPT-5, which the company states has fewer hallucinations but still exhibits the behavior. The research provides statistical analysis of how next-word prediction training creates conditions for factual inaccuracies.

Current Evaluation Methods Reward Guessing Over Honesty

The OpenAI team demonstrated how accuracy-focused benchmarks create perverse incentives for model behavior. When evaluations measure only the percentage of questions answered correctly, models learn to guess rather than abstain from answering uncertain queries.

The researchers illustrated this with SimpleQA evaluation data comparing two models. The newer gpt-5-thinking-mini showed a 52% abstention rate with 26% errors, while the older o4-mini had only 1% abstentions but 75% errors. Despite similar accuracy rates (22% versus 24%), the older model's strategy of guessing produced three times more hallucinations.

This pattern reflects what the researchers call "teaching to the test" - when models optimize for accuracy metrics without considering the cost of confident wrong answers. The approach mirrors multiple-choice tests where random guessing can appear better than honest uncertainty on scoreboards.

Statistical Origins of Factual Errors in Pretraining

The research examined why language models produce specific factual inaccuracies during pretraining on unlabeled text. Unlike spelling or syntax errors that follow consistent patterns and disappear with scale, arbitrary facts like birthdays or dissertation titles cannot be reliably predicted from contextual patterns alone.

The team compared this to image classification scenarios where labeling photos by pet birthdays would inevitably produce errors regardless of algorithmic sophistication. In pretraining, models encounter only positive examples of fluent language without true/false labels, making it difficult to distinguish valid statements from plausible fabrications.

This analysis explains why models can achieve near-perfect performance on structured tasks while still generating confident falsehoods about low-frequency factual claims. The statistical mechanisms underlying hallucinations are predictable rather than mysterious glitches.

Implications for Enterprise AI Deployment

For European organizations evaluating language models, these findings highlight the importance of examining error rates alongside accuracy metrics. The research suggests that models with higher abstention rates may be more suitable for applications where confident errors carry significant business or compliance risks.

The proposed solution involves updating evaluation frameworks to penalize confident errors more heavily than uncertainty expressions. This approach could influence model selection criteria for enterprises requiring high reliability, particularly in regulated sectors where AI explainability and uncertainty quantification matter for audit purposes.

Multilingual teams should note that smaller models may actually perform better at knowing their limitations in languages they haven't mastered, according to the research. This could affect deployment strategies for organizations operating across multiple European markets with diverse language requirements.

Measuring Reliability Beyond Accuracy Scores

OpenAI's research challenges the AI industry's reliance on accuracy-based leaderboards that dominate model comparisons. The team argues that fixing primary evaluation metrics is more important than developing specialized hallucination tests, since hundreds of traditional benchmarks continue rewarding guessing behavior.

The proposed scoring changes would give partial credit for appropriate uncertainty expressions while penalizing confident wrong answers. This echoes standardized testing approaches that use negative marking to discourage blind guessing, but requires broader adoption across AI evaluation frameworks.

For technical teams building applications with language models, the research suggests implementing confidence thresholds and uncertainty detection as core reliability features rather than optional add-ons. This approach could improve system trustworthiness even as underlying models continue improving.

Original source: OpenAI published this research at https://openai.com/index/why-language-models-hallucinate with accompanying technical paper available on arXiv.

AI News Updates

Subscribe to our AI news digest

Weekly summaries of the latest AI news. Unsubscribe anytime.

EU Made in Europe

Chat with 100+ AI Models in one App.

Use Claude, ChatGPT, Gemini alongside with EU-Hosted Models like Deepseek, GLM-5, Kimi K2.5 and many more.