Sakana Fugu Just Matched Anthropic's Best Models Without Accessing Them — A Deep Dive Into the Multi-Agent Orchestration Approach

Tokyo Tower and city skyline at night representing Sakana AI's Tokyo headquarters — *Image: David Kernan via Wikimedia Commons (CC BY 4.0)*

Something interesting happened today. A Tokyo-based AI lab called Sakana AI released a product that doesn’t just compete with frontier models — it rethinks what a “model” even means. And the timing couldn’t be more deliberate.

Hot on the heels of the US government restricting access to Anthropic’s most advanced systems, Sakana AI’s Sakana Fugu arrives as both a technological achievement and a strategic hedge. It’s not a bigger model. It’s a smarter coordinator — a 7-billion-parameter model trained to orchestrate the world’s best AI systems through a single API, matching the performance of models it can’t even access directly.

Let me break down what Fugu actually is, how it performs, what it costs, and — most importantly — how it stacks up against the current frontier.

Table of Contents

What Is Sakana Fugu, Exactly?

This is where it gets interesting. Fugu isn’t a monolithic model like GPT-5.5 or Claude Opus 4.8. It’s a multi-agent orchestration system delivered as a foundation model API. You send a request to one endpoint, and Fugu decides which specialist models should handle which subtasks, how they should communicate, and how to combine their outputs into a single coherent answer.

Under the hood, Fugu is a fine-tuned Qwen2.5-7B — a relatively small 7-billion-parameter model — trained via reinforcement learning using techniques Sakana published in two ICLR 2026 papers: TRINITY (an evolved LLM coordinator that assigns Thinker/Worker/Verifier roles) and Conductor (which learns to discover natural-language coordination strategies through trial and error).

What makes this approach radical is that the orchestration is learned, not hardcoded. It’s not a LangChain pipeline or a hand-crafted MoA pattern. Fugu organically discovered strategies like difficulty gating (simple questions get 1–2 agents, complex coding tasks get up to 4), specialization (assigning Gemini 2.5 Pro as the planner, GPT-5 as the code optimizer), and even recursive self-orchestration — calling itself again when it senses its first attempt fell short.

As Sakana’s Yujin Tang put it: “A small model, by reading itself, can iterate toward answers that neither it nor any of its workers could reach in a single pass.”

Two Flavors: Fugu and Fugu Ultra

Sakana launched two variants today:

Fugu — Balanced for everyday tasks: coding, code review, chatbots, lightweight research. Lower latency, configurable agent pool.
Fugu Ultra — The full orchestration system optimized for maximum quality on hard problems: research, security assessments, patent analysis, deep multi-step reasoning.

Both use the same OpenAI-compatible API, so teams already wired into GPT or Claude can swap endpoints in minutes.

Benchmark Performance: Matching the Restricted Frontier

This is the headline that matters. Fugu Ultra matches or beats Anthropic’s Fable 5 and Mythos Preview — models that are explicitly not in its agent pool due to US export controls. Those models aren’t even available to route to. Fugu matches them using only publicly accessible models.

Here’s how Fugu and Fugu Ultra compare against the top publicly available models:

Benchmark	Fugu	Fugu Ultra	Opus 4.8	Gemini 3.1 Pro	GPT-5.5
SWE-Bench Pro	59.0	73.7	69.2	54.2	58.6
LiveCodeBench Pro	87.8	90.8	84.8	82.9	88.4
Humanity’s Last Exam	47.2	50.0	49.8	44.4	41.4
GPQA-D	95.5	95.5	92.0	94.3	93.6
SciCode	60.1	58.7	53.5	58.9	56.1
MRCRv2	86.6	93.6	87.9	84.9	94.8

Fugu Ultra leads or ties on 8 out of 10 benchmarks. The standard Fugu variant even edges ahead on SciCode and Long Context Reasoning, suggesting the lighter calibration is sometimes better for document-heavy tasks. On real-world application testing — Rubik’s Cube solving (300/300 solved), automated research, mechanical CAD design, and financial time-series prediction — Fugu Ultra consistently outperformed GPT-5.5, Opus 4.8, and Gemini 3.1 Pro individually.

One stat that stopped me: in a blind trading simulation, Fugu Ultra returned +19.43% across five runs. Every other frontier model returned under +15%. That’s not investment advice, but the capability gap is real.

Pricing Comparison: Is Fugu Actually Cheaper?

This is where things get nuanced. Let me put Fugu Ultra’s pricing next to the current frontier:

Model	Input (per 1M tokens)	Output (per 1M tokens)	Cached Input
Fugu Ultra	$5	$30	$0.50
GPT-5.5	$5	$30	$0.50
Claude Opus 4.8	$5	$25	$0.50
Gemini 3.1 Pro	$2	$12	$0.20
Claude Fable 5	$10	$50	$1.00
Claude Mythos Preview*	$25	$125	N/A

* Mythos Preview is restricted — not publicly accessible.

At first glance, Fugu Ultra costs the same as GPT-5.5 on input and output — $5/$30 per million tokens. But here’s the catch: Fugu Ultra uses orchestration tokens behind the scenes. Every time Fugu routes a subtask to a worker model, those tokens count toward your bill. Sakana is transparent about this — they return separate orchestration_input_tokens and orchestration_output_tokens fields in their API response, and they’re billed at the same rate as your visible tokens.

The early user reports suggest the orchestration overhead averages about 1,820 tokens per request — compared to Mixture-of-Agents frameworks that burn through 11,200+ tokens for similar coordination. The RL-trained conductor is remarkably efficient, averaging just 3 steps per workflow.

Sakana also offers a no-fee-stacking guarantee: even when Fugu coordinates multiple models, you pay a single rate based on the top-tier model involved, not the sum of all agents.

Subscription Plans

For lighter workloads, there’s a subscription tier:

Standard ($20/month) — Lightweight daily usage
Pro ($100/month) — Regular coding and analysis (10× Standard allowance)
Max ($200/month) — Heavy long-running workloads (20× Standard allowance)

Subscribe before the end of July 2026 and your second month is free. Every tier includes both Fugu and Fugu Ultra.

The subscription model makes Fugu attractive for individual developers. At $200/month for the Max plan, you get 20× the baseline allowance with both variants included. Compare that to building your own multi-agent setup where you’re paying individual API bills for GPT-5.5 AND Claude AND Gemini — plus the engineering time to wire them together. The broader AI pricing landscape has been shifting dramatically, with providers rethinking how they charge for increasingly capable models.

Wait — How Can a 7B Model Beat Frontier Models?

This is the question I kept asking while reading Sakana’s benchmarks. The answer is surprisingly elegant: orchestration is a different kind of intelligence than raw generation.

The 7B conductor model doesn’t need to be the smartest model in the room. It needs to be the best manager — knowing which specialist to call, when to delegate, and how to synthesize outputs. This is analogous to how a tech lead doesn’t need to be the best coder on every layer of the stack; they need to know who to ask and when.

In VentureBeat’s deep-dive on the Conductor paper, Sakana’s team highlighted the 84% reduction in token usage compared to hard-coded Mixture-of-Agents frameworks. The conductor uses reinforcement learning to discover efficient coordination patterns, rather than burning tokens on redundant parallel processing.

The recursive self-orchestration feature is particularly clever. When Fugu senses its first attempt was weak, it can call itself again, read its prior output as context, and revise its coordination strategy. The depth of this recursion becomes a tunable compute axis at inference time — without retraining.

The Strategic Angle: AI Sovereignty

Here’s the part that resonates beyond benchmarks. The US government recently restricted access to Anthropic’s most advanced models (Fable 5, Mythos Preview) over export control concerns. Developers and enterprises outside the US — that includes us in the Philippines — effectively lost access to frontier capability overnight.

Sakana Fugu is explicitly positioned as a hedge against this. The agent pool is entirely swappable — if one provider becomes unavailable, Fugu routes around it. The system improves automatically as better models enter the pool. You don’t rebuild integrations or re-architect workflows.

Sakana’s tagline — “Collective intelligence serves as the practical hedge against this concentration of power” — isn’t just marketing. It’s a genuine architectural response to the geopolitical reality of AI in 2026.

Who Should Use Fugu (and Who Shouldn’t)

Based on the beta feedback and benchmarks, I’d say Fugu shines in specific scenarios:

Great for:

Multi-step research and paper reproduction (beta users reported 3-4 day tasks compressed to hours)
Security assessments — one tester ran a full end-to-end assessment with recon, XSS/SQLi checks, and a clean evidence-backed report
Code review — beta feedback consistently mentions Fugu finding 20+ issues where GPT-5.5 flags 3
Complex agentic workflows that cross multiple domains
Teams outside the US affected by model export restrictions

Probably overkill for:

Simple Q&A or factual lookup (as Sakana’s team admits: “It’s hard to beat the economic proposition of a local model running on the user’s machine for simple queries”)
Short-context chat applications where latency matters more than depth
Teams already happy with a single model’s performance on their specific workload

Availability and Caveats

Fugu is generally available today via console.sakana.ai with an OpenAI-compatible endpoint. One notable gap: it’s currently unavailable in the EU and EEA while Sakana works toward GDPR compliance. If you’re in Europe, you’ll need to wait.

On the tech side, there are real unknowns. How does the orchestration perform under production load across thousands of concurrent users? How predictable is the latency for latency-sensitive applications? And while the benchmark results are impressive against Fable 5 and Mythos Preview, those comparisons use reported scores — neither model was in Fugu’s pool for testing.

Also worth noting: the orchestration tokens add to your bill in a way that isn’t immediately visible from just the headline pricing. A $5/$30 rate looks identical to GPT-5.5, but the effective per-task cost depends on how much orchestration overhead your specific workload triggers.

Bottom Line

Sakana Fugu represents a genuine architectural shift in how we think about frontier AI performance. Instead of scaling models vertically, it scales coordination — and the early evidence suggests this approach works. The fact that a 7B conductor can orchestrate access to models it can’t even directly match, and still come out ahead on most benchmarks, tells me the orchestration paradigm has legs.

For developers and enterprises — especially those of us outside the US — Fugu offers something the current frontier models can’t: a hedge against geopolitical disruption, with performance that’s competitive today and can only improve as more models enter the pool. It’s yet another sign that the AI landscape is diversifying in ways that benefit everyone.

Is it a GPT-5.5 killer? No. It’s something more interesting — a different category of product altogether. And in a year where AI access is increasingly shaped by regulation and geopolitics, that might matter more than any single benchmark score.

Featured image: Tokyo Tower at night in Minato City, Tokyo. Sakana AI is headquartered in Tokyo, Japan. Image by David Kernan via Wikimedia Commons (CC BY 4.0).