AI News

OpenAI GDPval: New Benchmark Tests AI Models on Real-World Professional Tasks

OpenAI introduces GDPval, a benchmark evaluating model performance on economically valuable tasks across 44 professional occupations, moving beyond academic tests to real-world work assessment.

LLMBase Editorial Updated September 25, 2025 3 min read
ai llm benchmark openai evaluation

GDPval includes 1,320 specialized tasks created by experienced professionals with an average of 14 years in their fields. Unlike traditional benchmarks that rely on academic questions or coding challenges, GDPval tasks mirror real deliverables such as legal briefs, engineering blueprints, customer support conversations, and nursing care plans.

Occupation Selection Based on Economic Impact

The benchmark covers occupations selected from the nine industries contributing over 5% to U.S. GDP, according to Federal Reserve Bank of St. Louis data. Within each industry, OpenAI identified the five occupations with the highest wage contributions that qualify as predominantly knowledge work, using Bureau of Labor Statistics employment data and O*NET occupational information.

The 44 occupations span sectors including professional services (software developers, lawyers, accountants), healthcare (registered nurses, nurse practitioners), manufacturing (mechanical engineers, industrial engineers), and finance (financial analysts, customer service representatives). Each occupation required at least 60% of its component tasks to involve knowledge work rather than physical labor to qualify for inclusion.

Task Design and Validation Process

Each task underwent multiple rounds of expert review, averaging five iterations including validation by task writers, additional occupational reviewers, and model-based checking. The benchmark includes reference files and context, with expected deliverables spanning documents, presentations, diagrams, spreadsheets, and multimedia outputs.

For example, one manufacturing engineering task requires designing a cable spooling jig for mining equipment testing, complete with 3D modeling requirements and presentation deliverables. The task provides realistic constraints and context that mirror actual workplace assignments.

The full dataset contains 30 tasks per occupation, with five tasks per occupation available in an open-sourced gold set for broader research access.

Implications for European AI Development

GDPval's focus on real-world professional tasks addresses a gap relevant to European enterprises evaluating AI capabilities for knowledge work automation. The benchmark's emphasis on deliverable-based assessment aligns with how European organizations typically structure professional work and could inform procurement decisions for AI tools.

The benchmark methodology also demonstrates an approach to AI evaluation that European regulators and standards bodies might reference when developing frameworks for assessing AI system capabilities in workplace contexts. The multi-expert validation process provides a model for ensuring evaluation quality and representativeness.

For European AI companies competing with frontier models, GDPval offers a standardized way to demonstrate capabilities on economically relevant tasks rather than academic benchmarks that may not translate to business value.

Current Limitations and Future Development

OpenAI acknowledges that GDPval represents an early step in realistic AI evaluation. The current version uses one-shot evaluations that don't capture iterative workflows or context-building over multiple interactions. Future versions plan to incorporate more interactive workflows and context-rich tasks that better reflect complex knowledge work patterns.

The benchmark also focuses on U.S. occupational data and GDP contributions, potentially limiting its relevance for European labor markets with different professional structures and regulatory environments.

GDPval joins OpenAI's expanding suite of practical benchmarks including SWE-Bench for software engineering, MLE-Bench for machine learning tasks, and Paper-Bench for scientific reasoning. This progression toward economically grounded evaluation represents a significant shift from purely academic AI assessment toward measuring business-relevant capabilities.

Original source: This analysis is based on OpenAI's announcement available at https://openai.com/index/gdpval

AI News Updates

Subscribe to our AI news digest

Weekly summaries of the latest AI news. Unsubscribe anytime.

EU Made in Europe

Chat with 100+ AI Models in one App.

Use Claude, ChatGPT, Gemini alongside with EU-Hosted Models like Deepseek, GLM-5, Kimi K2.5 and many more.