How well do AI agents represent the real world of work?According to a large study, the development of AI agents is very focused on programming tasks and ignores the scope of the job market.
According to the study, AI agents are developed to ignore the real world of work
How well do AI agents represent the real world of work?A large study shows that the development of AI agents is almost exclusively focused on programming tasks, ignoring the majority of the labor market.
A team of researchers from Carnegie Mellon University and Stanford University systematically compared benchmarks of 43 agents with a total of 72,342 assignments to the US labor market.They assigned benchmark tasks to 1,016 real-world occupations based on the US government's O*NET database, which contains multi-level catalogs of occupational activity details.
The research shows that there is a clear imbalance.Current agent development focuses almost exclusively on the areas of computer science and mathematics, which mainly involves programming tasks.However, this sector represents only 7.6 percent of total employment in the United States.
Highly digitalized industries are largely untested
The analysis shows many functional areas that are highly digitized but difficult to see in existing benchmarks.According to the study, management has a digitization rate of 88 percent, but only 1.4 percent was reflected in the analysis of all benchmark functions.Legal (70 percent digital) is 0.3 percent, and architecture and engineering (71 percent digital) is just 0.7 percent.
According to the researchers, AI agents can provide short-term productivity gains in precisely these areas.At the same time, these areas pose specific technical challenges, such as vague goals and results that can only be verified over long periods of time.
Auch aus ökonomischer Perspektive klaffe eine Lücke. Betrachtet man die Kapitalverteilung, also das Gesamteinkommen pro Berufsfeld, bleiben gerade die wirtschaftlich wertvollsten Bereiche wie Management und Recht in den Benchmarks unterrepräsentiert. Gleichzeitig werden schlecht bezahlte, arbeitsintensive Bereiche wie persönliche Dienstleistungen und Pflege ebenfalls kaum berücksichtigt.
Agents master less than five percent of the required skills
Inequality at the level of individual skills is also evident. Researchers have found academic skills in writing documents, which are divided into four categories of professional skills: access to information; mental processes, cooperation with others, and work results. The skills needed in the real world of the workplace are evenly distributed in all categories.
On the other hand, the criteria for agents focus on two specific skills: “retrieving information” and “working with computers”.Collectively, they account for less than five percent of US employment.
The researchers attributed this bias to methodological advantages.Domains with easily formulated task instructions and easily verifiable outcomes will be disproportionately preferred.In niche areas that have led to rapid systemic progress, there is a risk that the development of agents will be diverted away from areas where the social and economic benefits will be greatest.
Researchers highlight OpenAI's GDPval value as positive: although the scope is relatively small, it covers a wide range of domains and specialized skills.OpenAI explicitly introduced the 2025 standard to further measure the impact of AI agents on real-world knowledge activities, across domains where possible.
Autonomy decreases rapidly as task complexity increases
To understand how AI agents can operate independently across the range of tasks involved, researchers have developed a quantitative measure of autonomy.They define autonomy as the maximum task complexity that an agent can overcome with a predetermined success rate.They measure task complexity based on the number of work steps required in a hierarchical workflow.
Even in software development, which is the most widespread field, the success rate drops sharply as the complexity of the tasks increases.Agents perform best in independent activities such as mental processes and production of work products, but cannot identify and retrieve information and coordinate with others, even for relatively simple tasks.
According to the study, the OpenHands system has advantages over the SWE agent and Claude GPT in several benchmarks that can be compared in a controlled manner, such as SWE-bench, especially for tasks of moderate complexity.However, the researchers note that these trends do not necessarily hold in other areas of complexity, and that they require more extensive publication of agent pathways for more systematic comparisons.
Three principles for good measurement
Based on their analysis, the researchers formulate three design principles for future benchmarks.First, new benchmarks should specifically cover underrepresented but highly digitalized fields, such as management and law, or aim for broad coverage across all fields and skills.
Second, the standards must become more accurate and complex.According to the analysis, many combinations are useful only to simplify the breakdown of the actual work.On the other hand, projects created by people, such as those in the GDPval or TheAgentCompany benchmarks, involve a lot of managers and expertise.If a combination is needed for a reason, the design work should be based on the reality and the combination.
Third, researchers are pushing for a more detailed evaluation.Anyone who only measures whether the agent has completely solved the task at the end is missing exactly where it fails.Instead, the researchers suggest that workflows are automatically derived from human demonstrations, creating intermediate checkpoints that provide a more nuanced picture of agent performance.
The study helps standardization designers identify gaps in job coverage; the study provides a framework and supplemental resources to help agent developers and users select the appropriate level of autonomy for their specific work to identify areas for improvement.
Anthropic analysis, based on millions of real interactions between humans and agents, recently showed that software developers make almost 50 percent of all calls to agent tools through public APIs, while other industries account for only a few points.Anthropic talks about the "early days of agent adoption."
A study by UC Berkeley and other partners reached a similar conclusion in late 2025: In practice, companies primarily use AI agents as simple, highly controlled devices with few autonomous steps.The biggest hurdle remains system reliability.
AI news without the hype - driven by people
With a subscription to Decoder, you read ad-free and become part of our community: discuss in the comment system, our weekly AI newsletter, the “KI Radar” Frontier newsletter 6 times a year with the latest advances from the cutting edge of AI research, up to 25% off AI Pro events, and full access to Arch for the past ten years.
Subscribe now - news without the fuss
Curated by people.
- More than 20 percent off the initial discount.
- Seamless reading - No Google ad banners.
- Access to the comment system and exchange with the community.
- Weekly AI newsletter.
- 6 times a year: "KI Radar" - dive deep into the most important AI topics.
Up to 25% off KI Pro online events.
- Access to all archives of the last ten years.
- The latest AI information from the decoder - clear and to the point.
