Q: What does a SWE-bench score of < 30% mean?

Plausible-looking code that often does not work.

Question 1

What is SWE-bench?

Accepted Answer

SWE-bench evaluates language models on their ability to resolve real GitHub issues from popular Python repositories. The model is given an issue description and the repository state, and must produce a patch that resolves the issue and passes the project's existing test suite. SWE-bench is the benchmark that most closely tracks "useful for autonomous coding agents" because the tasks are not toy problems, the success criteria is the project's actual tests, and the input footprint forces the model to reason over real-world code at scale.

Question 2

Which AI model leads the SWE-bench leaderboard?

Accepted Answer

As of 2026-07-24, Claude Fable 5 from Anthropic leads the SWE-bench leaderboard with a score of 95%. The full ranked list of 19 models is on this page, updated as we ingest new scores.

Question 3

How is SWE-bench scored?

Accepted Answer

Scores are reported as resolution rate (% of issues correctly patched). The headline number on TensorFeed is the SWE-bench Verified subset, the human-validated tasks where the test suite has been confirmed to be a fair signal. Anything above 60% as of 2026 represents a genuinely useful coding agent; the very top of the leaderboard is approaching 75-80%.

Question 4

Why does SWE-bench matter for AI agents?

Accepted Answer

If you are building a coding agent, this is the benchmark that matters most. Models with high SWE-bench scores produce patches that compile, pass tests, and respect existing patterns in the codebase. Models with low SWE-bench scores produce code that looks plausible but breaks the build.

Question 5

What does a SWE-bench score of 70%+ mean?

Accepted Answer

Frontier-class. Genuinely useful coding agent territory.

Question 6

What does a SWE-bench score of 50-70% mean?

Accepted Answer

Production-ready for assisted coding workflows.

Question 7

What does a SWE-bench score of 30-50% mean?

Accepted Answer

Useful for narrow tasks but not autonomous agents.

Question 8

What does a SWE-bench score of < 30% mean?

Accepted Answer

Plausible-looking code that often does not work.

#	Model	Provider	Score	Released
1	Claude Fable 5	Anthropic	95%	2026-06
2	Claude Opus 4.8	Anthropic	88.6%	2026-05
3	Claude Opus 4.7	Anthropic	87.6%	2026-04
4	Claude Sonnet 5	Anthropic	85.2%	2026-06
5	GPT-5.5	OpenAI	82.6%	2026-04
6	Claude Opus 4.6	Anthropic	80.8%	2026-03
7	DeepSeek V4 Pro	DeepSeek	80.6%	2026-04
8	Claude Sonnet 4.6	Anthropic	79.6%	2026-02
9	DeepSeek V4 Flash	DeepSeek	79%	2026-04
10	Mistral Medium 3.5	Mistral	77.6%	2026-05
11	Claude Haiku 4.5	Anthropic	73.3%	2026-01
12	Gemini 2.5 Pro	Google	63.8%	2026-01
13	o3-mini	OpenAI	49.3%	2025-11
14	o1	OpenAI	48.9%	2025-09
15	Mistral Large	Mistral	47.2%	2025-11
16	DeepSeek V3	DeepSeek	42%	2025-12
17	GPT-4.5	OpenAI	38%	2025-12
18	GPT-4o	OpenAI	33.2%	2025-05
19	Llama 4 Maverick	Meta	24%	2026-03

SWE-bench leaderboard

Full leaderboard

Score interpretation

Why this matters for AI agents

Other benchmarks

Premium API: time-series for SWE-bench