LIVE
OPUS 4.7$15 / $75per Mtok
SONNET 4.6$3 / $15per Mtok
GPT-5.5$10 / $30per Mtok
GEMINI 3.1$3.50 / $10.50per Mtok
SWE-BENCHleader Claude Opus 4.772.1%
MMLU-PROleader Opus 4.788.4
VALS FINANCEleader Opus 4.764.4%
AFTAv1.0 whitepaper live at /whitepaper
OPUS 4.7$15 / $75per Mtok
SONNET 4.6$3 / $15per Mtok
GPT-5.5$10 / $30per Mtok
GEMINI 3.1$3.50 / $10.50per Mtok
SWE-BENCHleader Claude Opus 4.772.1%
MMLU-PROleader Opus 4.788.4
VALS FINANCEleader Opus 4.764.4%
AFTAv1.0 whitepaper live at /whitepaper
All systems operational0 AI providers monitored, polled every 2 minutes
Live status
Back to Originals

OpenAI Just Disproved an 80-Year Erdős Conjecture. The Model Was Not Trained for Math.

Kira Nolan··7 min read

On Wednesday May 20, OpenAI announced that an internal general-purpose reasoning model disproved a conjecture about the planar unit distance problem that Paul Erdős first posed in 1946. The proof runs roughly 125 pages, uses Golod-Shafarevich theory and infinite class field towers, and arrives at a polynomial improvement over the square-grid construction mathematicians had treated as effectively optimal for eighty years. Fields medalist Tim Gowers and Princeton mathematician Will Sawin verified it. Sawin tightened the bound to n raised to the power of one plus delta, where delta equals 0.014.

That is the headline. The structural story is the sentence OpenAI buried two paragraphs into the post: the model was not trained on the unit distance problem, it was not given problem-specific search tooling, and it was not a specialized theorem prover. It was a general-purpose reasoning model, the same kind of system OpenAI sells through the API.

I have been watching the AI-for-math beat closely since AlphaProof cleared a silver medal at IMO 2024. Until this week, every public frontier result on a hard open problem leaned on a math-shaped harness: Lean integration, search-over-tactics, retrieval against a problem-specific corpus, RL on closely related families. This one did not. That is the line worth circling.

What the Problem Actually Asks

The planar unit distance problem is one of the oldest open questions in combinatorial geometry. Place n points in a plane. Count the pairs that are exactly distance 1 apart. What is the maximum count you can achieve as n grows large?

The trivial upper bound is n squared, because that is the number of pairs you have to start with. The trivial lower bound is around n times the square root of log n, which a carefully aligned square grid will give you. The Erdős conjecture, in the version mathematicians were trying to defend through the 1970s, was that you cannot meaningfully beat the square-grid construction by anything polynomial in n. Bumping the bound up by a fixed polynomial factor was treated as the kind of thing that, if it were possible, would have shown up already.

The OpenAI proof constructs an infinite family of configurations that breaks that assumption. It exhibits arrangements with n raised to one plus delta unit-distance pairs, for a fixed positive delta. Sawin pinned delta down to 0.014. That is small in absolute terms and structural in mathematical terms: an existence proof that the gap is at least polynomial closes off an entire class of upper-bound arguments mathematicians had been building since the 1980s.

The Surprise Was the Toolbox

The construction does not come from combinatorial geometry. It comes from algebraic number theory, specifically Golod-Shafarevich theory on infinite class field towers, a piece of machinery developed in the 1960s for a question about how class groups grow in algebraic number fields. The bridge between number-theoretic towers and a count of unit-distance pairs in the plane is what reviewers flagged as the surprise. Two adjacent subfields of mathematics that rarely talk to each other, glued together for a result in neither one's home territory.

That is also where the "general purpose" framing earns its keep. A system trained narrowly on combinatorial geometry would have had no reason to reach into class field towers. A system trained narrowly on number theory would have had no reason to produce a unit-distance construction. A model whose pretraining mix covers both, plus enough connective tissue to suggest the bridge, can. That is the capability that becomes interesting at scale.

How It Compares to Prior Frontier Math Runs

SystemResultMath-Specific ScaffoldingDate
OpenAI (this model)Disproof, planar unit distanceNoneMay 20, 2026
DeepMind AlphaProof / AlphaGeometry 2Silver medal, IMO 2024Lean integration, geometry DSLJul 2024
DeepMind FunSearchCap-set lower bound (n=8)Program-search loopDec 2023
Numina (open-source)AIMO Progress PrizeMath-tuned base, tool use2024 to 2025
Anthropic Claude MythosFrontier cyber, math reasoningCyber range scaffoldingApr to May 2026

Read this table the right way. The earlier systems were not embarrassed by the comparison; they were all designed to push specific frontiers. The line OpenAI claims to have crossed is that the same model you would deploy to draft code, summarize a deposition, or run an autonomous agent produced this result. That claim is what Gowers and Sawin verified at the mathematical layer, but the systems claim sits with OpenAI and will need a paper to fully stand up.

Why 125 Pages Matters

The page count is doing real work in the announcement. Coherent mathematical proofs that run more than twenty pages are difficult for current models. Most produce arguments that drift, repeat lemmas without naming them, or accept circular reasoning by the eighth or ninth step. 125 pages of proof that two reviewers can follow is, on its own, a frontier result on long-horizon coherence.

For comparison, a typical SWE-Bench Pro trajectory the same class of model produces is in the range of 5 to 15 tool calls before the agent loses the plot. The unit-distance proof is essentially one continuous tool call against an internal scratchpad. Whatever inference-time technique is doing the heavy lifting here (likely a tree-search variant over partial proof states, with self-verification at each node) is something OpenAI has not described in detail and probably will not until a paper drops.

What It Does to the Research-Discovery Rail

We have been writing about the research-discovery rail (the layer where models propose, verify, and ship novel scientific results) as a category that was approaching but not yet present. Claude Mythos surfacing 271 Firefox zero-days in one cycle was the security version of that rail. The OpenBSD vulnerability that survived 27 years of human review, flagged by Mythos last month, was another. This is the mathematics version.

The pattern across all three: a model with no domain-specific scaffolding finds something experts missed for years or decades, in a domain where exhaustive expert attention is the baseline. The interesting variable is no longer whether the models can do this. It is how often, in which domains, and under what verification regime.

For mathematics specifically, the next thirty days are the test. The unit-distance proof will be picked apart at the arXiv level. Reviewers will look for gaps Sawin and Gowers did not catch, for hidden assumptions, for places where the model leaned on a near-duplicate result that already existed in the literature. If it holds, OpenAI will almost certainly run the same model against other open problems on the Erdős list, and the next announcement will tell us whether this was a singular finding or a repeatable process.

The Capability the Market Is Not Yet Pricing

The current API pricing tier for general-purpose reasoning models tops out at $30 per million output tokens (GPT-5.5) and $75 per million output tokens (Claude Opus 4.7). Those prices are set against agentic-coding and long-context workloads. Neither is priced against "produce a 125-page novel proof on a 1946 open problem." That use case does not have a market yet, because until this week it was not on the menu.

The interesting question is what happens to research budgets at universities, pharma R&D arms, and national labs once a vendor can credibly say "our base model produced a novel polynomial improvement on an Erdős conjecture without any math-specific training." The answer is probably not "we replace mathematicians." It is more likely "every senior researcher gets a multi-million-dollar inference budget and a list of their twenty favorite open problems to throw at it."

That is a different shape of demand than the current API mix. It is closer to compute procurement than to per-token inference, and it favors the vendor that can sustain high-cost, long-horizon runs at low marginal failure rates. OpenAI is signaling it wants the seat. Anthropic, with Mythos' record on autonomous discovery in security and the Karpathy pretraining team that landed on Tuesday, is signaling the same thing from the other direction.

What to Watch

Three signposts over the next month. First, whether the proof survives independent formalization in Lean or Coq. Gowers and Sawin reviewed it on paper. A formal verification run, even on a few load-bearing lemmas, is the next gate. Second, whether OpenAI publishes the inference-time recipe. The 125-page coherence is the systems-level claim, and it needs a method-section to stand up. Third, whether a second open problem falls to the same model. If it does, the research-discovery rail is no longer an emerging category. It is the category that defines the next pricing tier.

For now, the right framing is that the line between "general-purpose chat model" and "research-grade scientific instrument" got blurrier this week. The model is the same one OpenAI was already shipping. The capability is the one nobody had named yet.