Solving Decades-Old Math While Failing at Basic Code
The artificial intelligence sector is currently experiencing a profound paradox. On one end of the spectrum, specialized AI systems are achieving historic breakthroughs in pure logic and mathematics. On the other end, autonomous AI coding agents applied to everyday software engineering are causing massive headaches for developers.
Recently, Google DeepMind’s AlphaProof Nexus made headlines by autonomously solving nine open Erdős problems. Two of these problems had stumped human mathematicians for 56 years. Impressively, the inference cost was just a few hundred dollars per problem. DeepMind achieved this not by relying solely on natural language guessing, but by integrating the Lean compiler to automatically verify every proof step. However, even this advanced system only maintained a 2.5% overall success rate.
Fast Prototypes, Hidden Bugs
Contrast this mathematical triumph with the reality of day-to-day software development. Renowned programmer George Hotz recently stated that the heavy reliance on autonomous AI coding agents will go down as “one of the most costly mistakes” in software development. After rigorous testing, Hotz noted that Large Language Models (LLMs) are fantastic at generating fast prototypes but fall apart on the details. They consistently introduce subtle, complex bugs that become increasingly difficult for human developers to identify and fix.
Further compounding the issue is the problem of “attribution hallucination.” Researchers at Peking University recently developed the CiteVQA benchmark, which proved that leading models like GPT and Gemini routinely cite text passages that don’t actually support their answers. Even when the AI provides the correct answer, the cited evidence is frequently entirely fabricated, creating massive risks for regulated industries.
We are treating AI like a senior engineer, when in reality, it acts more like a brilliant but reckless intern who works at lightspeed but refuses to double-check their math.
Why It Matters
The dichotomy between DeepMind’s math success and the failure of general coding agents highlights a fundamental limitation of current generative AI: it struggles with strict contextual boundaries without rigorous, programmatic guardrails (like the Lean compiler). For the software industry, this is a wake-up call. Companies firing junior developers in favor of AI coding agents might face massive technical debt in the near future. The industry must pivot from treating LLMs as autonomous software engineers to using them as high-powered typing assistants, integrating strict verification compilers into the AI workflow before the bugs become unmanageable.