LIVE
OPUS 4.7$15 / $75per Mtok
SONNET 4.6$3 / $15per Mtok
GPT-5.5$10 / $30per Mtok
GEMINI 3.1$3.50 / $10.50per Mtok
SWE-BENCHleader Claude Opus 4.772.1%
MMLU-PROleader Opus 4.788.4
VALS FINANCEleader Opus 4.764.4%
AFTAv1.0 whitepaper live at /whitepaper
OPUS 4.7$15 / $75per Mtok
SONNET 4.6$3 / $15per Mtok
GPT-5.5$10 / $30per Mtok
GEMINI 3.1$3.50 / $10.50per Mtok
SWE-BENCHleader Claude Opus 4.772.1%
MMLU-PROleader Opus 4.788.4
VALS FINANCEleader Opus 4.764.4%
AFTAv1.0 whitepaper live at /whitepaper
All systems operational0 AI providers monitored, polled every 2 minutes
Live status
All benchmarks

SWE-bench leaderboard

SWE-bench evaluates language models on their ability to resolve real GitHub issues from popular Python repositories. The model is given an issue description and the repository state, and must produce a patch that resolves the issue and passes the project's existing test suite. SWE-bench is the benchmark that most closely tracks "useful for autonomous coding agents" because the tasks are not toy problems, the success criteria is the project's actual tests, and the input footprint forces the model to reason over real-world code at scale.

Current leader
GPT-5.5(OpenAI)68.7%

Last refreshed 2026-05-24. 18 models scored on this benchmark.

Full leaderboard

#ModelProviderScoreReleased
1GPT-5.5OpenAI68.7%2026-04
2Claude Opus 4.7Anthropic65.4%2026-04
3DeepSeek V4 ProDeepSeek63.8%2026-04
4Claude Opus 4.6Anthropic62.3%2026-03
5Gemini 2.5 ProGoogle59.4%2026-01
6o1OpenAI58.9%2025-09
7GPT-4.5OpenAI56.1%2025-12
8Claude Sonnet 4.6Anthropic55.7%2026-02
9Llama 4 MaverickMeta52.8%2026-03
10DeepSeek V3DeepSeek51.4%2025-12
11o3-miniOpenAI49.3%2025-11
12DeepSeek V4 FlashDeepSeek48.9%2026-04
13GPT-4oOpenAI48.5%2025-05
14Mistral LargeMistral46.2%2025-11
15Llama 4 ScoutMeta44.6%2026-02
16Gemini 2.0 FlashGoogle43.1%2025-10
17Claude Haiku 4.5Anthropic41.2%2026-01
18Mistral SmallMistral34.7%2025-09

Score interpretation

Scores are reported as resolution rate (% of issues correctly patched). The headline number on TensorFeed is the SWE-bench Verified subset, the human-validated tasks where the test suite has been confirmed to be a fair signal. Anything above 60% as of 2026 represents a genuinely useful coding agent; the very top of the leaderboard is approaching 75-80%.

70%+
Frontier-class. Genuinely useful coding agent territory.
50-70%
Production-ready for assisted coding workflows.
30-50%
Useful for narrow tasks but not autonomous agents.
< 30%
Plausible-looking code that often does not work.

Why this matters for AI agents

If you are building a coding agent, this is the benchmark that matters most. Models with high SWE-bench scores produce patches that compile, pass tests, and respect existing patterns in the codebase. Models with low SWE-bench scores produce code that looks plausible but breaks the build.

Other benchmarks

Premium API: time-series for SWE-bench

The leaderboard above is a snapshot. Want to see how a model's SWE-bench score has moved over the last 30-90 days, or set a webhook that fires when a score crosses a threshold? The premium API has both:

SWE-bench source ·Last refreshed 2026-05-24·Max score 100