OpenAI is attempting to make the case that AI can really be helpful at work, as some current research have proven that firms aren’t getting a lot out of their AI investments.
On Tuesday, the ChatGPT-maker launched a report introducing a brand new benchmark for testing AI on “economically beneficial, real-world duties” throughout 44 completely different jobs. The analysis is named GDPval, and OpenAI says it’s meant to floor office AI debates in proof moderately than hype—and observe how fashions enhance over time.
It comes on the heels of a current MIT Media Lab research that discovered fewer than one in ten AI pilot tasks delivered measurable income positive factors and warned that “95 percent of organizations are getting zero return” on their AI bets. And simply final week, researchers from Harvard Enterprise Evaluate’s BetterUp Labs and Stanford’s Social Media Lab blamed “workslop” for the lackluster outcomes. They outline workslop as “AI-generated work content material that masquerades pretty much as good work, however lacks the substance to meaningfully advance a given activity.”
OpenAI argues that GDPval fills a spot left by present benchmarks, which usually check AI fashions on summary educational issues moderately than the sorts of day-to-day duties folks really do at work.
What GDPval measures
“We name this analysis GDPval as a result of we began with the idea of Gross Home Product (GDP) as a key financial indicator and drew duties from the important thing occupations within the industries that contribute most to GDP,” OpenAI wrote in a weblog put up asserting the report.
The primary model of the benchmark spans 44 jobs throughout the 9 industries that make up the biggest share of U.S. GDP, together with actual property, authorities, manufacturing, and finance. Inside every sector, OpenAI zeroed in on roles that drive the very best wages and compensation, specializing in what they referred to as data work.
To construct the check set, OpenAI recruited professionals from these industries, averaging 14 years of expertise, to design real-world duties. Every knowledgeable additionally created a human-written instance of how the duty must be completed. Instance assignments embrace drafting a authorized temporary, producing an engineering blueprint, dealing with a buyer help alternate, or writing a nursing care plan.
The report accommodates 30 absolutely reviewed duties per occupation, plus a smaller “gold set” of 5 open-sourced duties per occupation. To measure efficiency, OpenAI used knowledgeable graders, professionals from the identical fields represented within the dataset. These professionals blindly graded the AI-generated deliverables with these produced by activity writers and supplied critiques and rankings. They then ranked every higher, pretty much as good as, or worse than each other.
What GDPval discovered
The report discovered that immediately’s high AI fashions are already closing in on the standard of labor produced by human consultants.
In exams on 220 duties from the GDPval gold set, evaluators in contrast deliverables from seven main fashions towards business professionals.
Claude Opus 4.1 got here out on high getting a 47.6% win and tie charge towards human-completed duties. It was particularly sturdy on aesthetics like doc formatting and slide structure.
GPT-5 excessive got here in second with a win and tie charge of 38.8%. Its energy was accuracy like rigorously following directions and performing appropriate calculations.
GPT-4o was in final place with a win and tie charge of solely 12.4%
The AI fashions carried out notably nicely on duties from occupations like counter and rental clerks; delivery, receiving, and stock clerks; gross sales managers; and software program builders.
They struggled extra with duties from occupations resembling industrial engineers, medical engineers, pharmacists, monetary managers, and video editors.
For instance, Claude Opus 4.1 had its highest win and tie charge with duties completed by counter and rental clerks (81%), adopted by delivery, receiving, and stock clerks (76%). Its lowest scores have been for duties carried out by industrial engineers and movie and video editors (each 17%), and by audio and video technicians (2%).
OpenAI additionally claims these fashions can knock out GDPval duties round 100 instances quicker and 100 instances cheaper than human consultants.
Nonetheless, OpenAI burdened that at the same time as AI reshapes the job market, it gained’t be capable to utterly substitute people. As the corporate put it, “most jobs are greater than only a assortment of duties that may be written down.”
“GDPval highlights the place AI can deal with routine duties so folks can spend extra time on the inventive, judgment-heavy components of labor,” OpenAI wrote.
Trending Merchandise
