Can Your LLM Earn $1M from Software Engineering?

OpenAI released SWE-Lancer, a real-world benchmark for coding models—complete with $1 million in actual freelance payouts up for grabs.

Feb 21, 2025

Camille Pissarro, The Louvre, Afternoon, Rainy Weather, 1900

We’re used to seeing benchmarks like HumanEval and competitive programming datasets push coding Large Language Models (LLMs) to the next level. But how do these models actually fare in real-world software development — the kind where customers pay money for bug fixes and feature requests? Enter SWE-Lancer, a new benchmark by OpenAI that puts LLMs through their paces on 1,400+ real freelance tasks from Upwork, worth $1 million in collective payouts.

Why SWE-Lancer?

Traditional coding benchmarks typically rely on self-contained problems (like reversing a linked list or solving a puzzle). These tasks give us some insight into a model’s coding ability, but they aren’t great analogs for day-to-day software engineering—especially the kind that gets you paid on a freelancing platform. SWE-Lancer addresses that gap with:

$1M Real-World Value
Each of the 1,488 tasks is an authentic freelance job posted on Upwork, with a payout that’s already been paid out to human contributors. Because each task has a real-market rate, it naturally reflects project complexity. Some were quick $50 bug fixes; others ballooned to $32,000 expansions or multi-week feature builds.
Full-Stack Coverage & E2E Testing
These tasks come from Expensify’s open-source codebase (yes, the expense-reporting platform). They require the model to handle front-end, back-end, and user-facing logic. Plus, each solution is tested by end-to-end (E2E) tests—modeled on how companies actually verify software in QA. No simple unit-tests or toy examples here.
IC & Manager Mode
SWE-Lancer tests both:
1. Individual Contributor (IC) tasks, where a model directly writes patches or feature implementations.
2. SWE Manager tasks, where the model selects the best approach from multiple competing pull-request proposals, mimicking the role of a team lead reviewing suggestions.

How It’s Built

Collected from Upwork
Expensify regularly posts their open issues on Upwork for freelancers. Researchers took 1,488 tasks from those postings — all real, all with final payouts.
Triple-Verified E2E Tests
A professional team wrote comprehensive Playwright tests to confirm each solution truly works. The agent (aka the LLM) must pass these tests to “earn” the payout.
Complex, Real Problems
Tasks often involve tricky edge cases, multi-file changes, or UI/UX logic. The average job took 26 days for freelancers to close and had 47 discussion comments before resolution.

The Results (So Far)

Multiple leading LLMs, including variations from OpenAI (GPT-4-based) and Anthropic (Claude 3.5), have tried the SWE-Lancer tasks. The short version:

Claude 3.5 “Sonnet” led the pack, but still only managed to solve 26% of IC tasks and less than half of the manager tasks in the benchmark’s public set (“SWE-Lancer Diamond”).
The best model racked up $400K on the full $1M set—impressive in theory, but it still leaves $600K on the table.

In other words: frontier LLMs are making headway but still fail well over half these real-world tasks. The field is far from solved.

Why It Matters

Code That Pays
Benchmarks are fun for scoreboard bragging. But because SWE-Lancer tasks are literally paid tasks from the past, success translates directly to real-world freelancer-level competence. Can an AI developer underbid humans on Upwork tomorrow? If so, we’ll likely see it first here.
Agentic Safety & Economic Impact
As LLMs get more capable, questions arise about how they’ll reshape the coding job market. Will they augment or replace certain roles? By mapping capabilities to real $$ payouts, this benchmark tracks these disruptions more accurately.
Manager vs. Developer
Real software development is more than just shipping code. Often, it’s about reviewing proposals and integrating them with existing designs and architecture. SWE-Lancer includes manager tasks precisely because they’re integral to the real dev cycle.

What’s Next?

Open-Sourcing & Competition
A public subset called SWE-Lancer Diamond ($500k’s worth of tasks) is free for you to try. The rest is a private holdout to detect overfitting and ensure stable leaderboards.
Tooling
Each IC task is shipped with a Docker environment, a local test suite, and a “user tool” that LLMs can call to run the code. This fosters iterative debugging—like a real dev environment.
AI Ecosystem
The community can build better training strategies, refine developer UIs, or create new modules to handle front-end edge cases. If your model wants to prove it can code for real, this is the new bar.

So far, even the best LLMs only manage a fraction of the total “earnings,” underlining how tricky actual freelance dev work remains. But these partial successes—and the money they correspond to—suggest that model-generated code is creeping ever closer to genuine economic viability.

If your LLM can pass E2E QA checks on tasks that real freelancers once solved, maybe your next big cost-saver is letting your agent take a crack at them, too.

Further Reading:

Official SWE-Lancer Diamond Repository
Paper: SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? (Miserendino et al., 2025)