Armin Ronacher dropped a post over the weekend that hit me right in the cognitive dissonance. The creator of Flask, Jinja2, and one of the co-founders of Sentry — someone whose opinion on software tooling I genuinely respect — published something called “Better Models: Worse Tools.” And the title says everything you need to know.

Here’s the short version: newer, smarter language models are actually getting worse at using tools that don’t look exactly like Claude Code’s internal interface. They hallucinate extra fields in tool calls. They invent parameters that don’t exist in your schema. And the root cause isn’t stupidity — it’s over-training.
Let me unpack what Armin found, why it matters, and what it says about the direction AI tooling is headed.
What’s Actually Breaking
The bug Armin documented is deceptively simple. Pi — his AI coding tool — uses an edit tool that expects an edits[] array with oldText and newText fields. Straightforward stuff. Older Claude models (Opus 4.5, Haiku) handled this perfectly. But Opus 4.8 and Sonnet 5 started appending random extra keys to the tool call arguments.
We’re not talking about the model getting the edit wrong. The actual text replacement was byte-correct every time. The model produced the right answer, then appended something like "requireUnique": true or "matchCase": true or, my personal favorite, "oldText_2": "...". It’s like solving a Rubik’s cube and then scribbling a grocery list on the side.
The full list of invented keys Armin documented is genuinely impressive: type, id, kind, unique, requireUnique, matchCase, in_file, forceMatchCount, children, notes, cost, oldText2, newText2, oldText_2, newText_2, and even event.0.additionalProperties — which looks suspiciously like a JSON Schema error message leaking into the model’s output space.
The Slop Harness Hypothesis
Armin’s explanation for why this happens is what makes the article worth reading. He argues it’s a training artifact — not model decay in the traditional sense, but a side effect of how Anthropic post-trains its models.
Here’s the chain of reasoning:
1. Claude Code is the reference harness. Anthropic’s post-training and reinforcement learning pipeline is heavily optimized for Claude Code’s own tools. That toolset uses a flat schema — file_path, old_string, new_string, and an optional replace_all flag. Simple, forgiving, permissive.
2. The harness absorbs errors. Claude Code’s client-side logic silently filters unknown keys from model outputs. It accepts parameter aliases. It retries malformed calls. If the model outputs a tool call with "replaceAll": true alongside "file_path": "foo.py", Claude Code just strips the extra field and proceeds. The RL reward system never penalizes the hallucination because the task still completes.
3. Stronger priors fight harder. The newer, more capable model has a stronger learned expectation that an edit operation includes an extra optional field — because in Claude Code, it does. When Pi’s schema doesn’t provide that field, the model fills it in with whatever token seems most probable at that high-entropy point in the generation.
Armin puts it bluntly: “The better-trained model might actually fight you harder because its prior is stronger.” That sentence is going to stick with me for a while.
Why This Matters Beyond a Single Bug
I’ve been using AI coding tools daily for about a year and a half now, and this post articulated something I’ve felt but couldn’t name. The tools we build on top of these models are not neutral. The interface you design — the shape of your tool schemas — determines how well the model can use them, and that relationship is getting less predictable, not more.
This isn’t just a Pi problem. Anyone building agentic workflows, custom tool harnesses, or AI-powered automation is going to run into this. The model you pick today might work perfectly with your interface. The model that’s two generations newer, supposedly smarter, might break everything because its training distribution shifted closer to some internal harness you can’t see.
There’s a parallel here to something I wrote about a while back with DuneSlide and AI agent security. When a dominant player controls both the model and the reference implementation, everyone else ends up inheriting the quirks of that vertical integration — whether it’s security blind spots or tool call hallucinations.
What Harness Builders Can Actually Do
Armin suggests three practical responses, and they’re worth internalizing:
Use strict mode. Anthropic’s strict tool invocation mode completely eliminates the hallucination problem. The trade-off is that it has hard limits on tool definition complexity, but if you’re building a serious tool, this should be your default.
Design flat schemas. The closer your tool schema is to Claude Code’s internal format, the better it will perform. That’s not ideal from an API design perspective, but it’s reality. A flat file_path + old_string + new_string pattern is more robust than a nested edits[] array that requires the model to generate JSON inside a string parameter.
Consider OpenAI’s approach. Armin notes that OpenAI’s Harmony format uses in-band grammar markers (<|constrain>json) to switch to constrained decoding mid-generation. This lets the model dynamically enter a JSON-constrained sampling mode for complex tool arguments. It’s more robust against these kinds of hallucinations because the sampling is constrained at the generation level, not just the parsing level.
I’d add a fourth: build your error handling before you ship. If you’re building a tool that relies on LLM tool calls, assume the model will occasionally send you garbage. Validate schemas strictly on the receiving end. Log unexpected fields. Build retry logic that feeds the error back to the model. Don’t silently filter — that just reinforces the hallucination pattern, much like the phantom squatting and AI hallucinated domains problem I covered recently.
The Bigger Picture
What makes “Better Models: Worse Tools” important isn’t the bug report. It’s what it reveals about the current state of AI platform development. We’re in a phase where the most capable models are also the most opinionated about how they should be used. Every RLHF cycle, every post-training optimization, every round of reinforcement learning hardens the model’s interface expectations.
If you’re building on Anthropic, you’re not just using their model — you’re inheriting the shape of Claude Code’s tool ecosystem, whether you realize it or not. The same dynamic is playing out across the industry. OpenAI pushes agents through Codex. Google routes everything through Gemini’s function calling. Claude Sonnet 5 was the latest sign of that shift Claude Sonnet 5 — and each platform is creating a gravity well around its own interface conventions.
For developers like me who build tooling on top of these models, the lesson is uncomfortable but necessary. Portability across model providers is going to get harder, not easier. The abstraction layers we build today — MCP servers, the agent frameworks, the custom harnesses — will need to account for the fact that model A and model B have different implicit assumptions about tool shape, even when they both technically support the same function calling API.
I don’t think this is permanent. The market tends to punish excessive lock-in over time. But for the next year or two, this is the reality. If you’re picking a model for an agentic workflow, test it against your specific tool schemas, not just general benchmarks. And if you’re building a tool that needs to survive model upgrades, build forgiving error handling, use strict mode when available, and assume that the next “better” model might actually be worse at your specific task.
Armin’s full post is worth your time if you build anything on top of LLM tool calling. You can read it at lucumr.pocoo.org. It’s a short read but it’ll change how you think about model-tool interfaces.