The Chip That Only Speaks Transformer — Etched and the Battle to Make AI Inference Cheap

# The Chip That Only Speaks Transformer — Etched and the Battle to Make AI Inference Cheap

Two Harvard dropouts built a chip that can only do one thing: run transformer models. That bet just got them a $5 billion valuation and a billion dollars in pre-orders. Here’s why it matters for everyone building with AI.

—

Inference is the quiet bottleneck nobody talks about enough.

We obsess over training — the arms race to build bigger models, more parameters, more compute crammed into a single training run. But once those models are built, someone has to actually *serve* them. Every time you ask ChatGPT a question, every time an AI agent summarizes a document, every time a coding assistant completes a line of code — that’s inference. And it’s eating the AI industry alive.

Meta spent over $60 billion on AI infrastructure last year, and the majority of that went toward inference — as when I covered Micron’s position in the AI chip race explored — serving models to billions of users. OpenAI’s API costs are fundamentally an inference problem. Every startup building on top of foundation models is fighting the same math: the cost per token determines whether their business model works or bleeds money.

This is the problem Etched is trying to solve. And their approach is audacious: build a chip that *only* does transformers, and do it better than Nvidia’s general-purpose GPUs.

Table of Contents

The Inference Problem Is an Economics Problem

Let me put this in concrete terms. Running inference on an Nvidia H100 for a large language model costs roughly $4.20 per million tokens. Nvidia’s Blackwell B200 brought that down to about $0.12 per million tokens — a 35x improvement. But even at $0.12, the math gets brutal when you’re serving millions of requests per day.

For context: a single query to ChatGPT might use 500-2,000 tokens. Multiply that across millions of users, and you’re talking tens of thousands of dollars per day in pure compute costs. For AI startups running on thin margins, this is existential.

The problem with GPUs is that they’re *general-purpose* processors. They can do matrix multiplication, convolutions, attention mechanisms, and a hundred other operations. That flexibility is great for training, where you need to run different operations at different stages. But for inference — where you’re running the same transformer architecture over and over — that flexibility comes with overhead. You’re paying for compute units you don’t need, memory bandwidth you can’t fully use, and power you’re wasting on operations that aren’t relevant.

This is where application-specific integrated circuits, or ASICs, enter the picture. An ASIC is a chip designed for exactly one task. No wasted transistors, no unused compute units. Every square millimeter of silicon is optimized for the specific operation you need.

What Etched Built: The Sohu Chip

Etced’s chip is called Sohu. It’s manufactured on TSMC’s 4nm process, packs 144GB of HBM3E memory, and achieves 1.8x the memory bandwidth of an H100. The key innovation: transformer attention is hardwired directly into the silicon.

Instead of running attention as software on a programmable compute unit, Sohu implements it as fixed-function logic. The chip can’t run convolutions, diffusion models, or anything that isn’t a transformer. But for pure transformer inference — which covers the vast majority of today’s AI workloads — it claims to be an order of magnitude faster than Nvidia’s next-generation Blackwell GPUs.

One Sohu server replaces 160 H100 GPUs. That’s not a typo.

The numbers are striking, but they come with important caveats. The performance claims are based on specific workloads — autoregressive language model inference. If you need to run a mixture-of-experts model with expert routing, or a diffusion model for image generation, Sohu can’t help you. It’s a single-purpose tool, and its value depends entirely on how dominant transformer architecture remains.

Right now, that dominance is near-total. GPT, Claude, Gemini, Llama, Mistral — they’re all transformers. The question is whether this holds for the next five years, or whether new architectures like state space models or hybrid approaches emerge to challenge the transformer’s throne.

The Funding Frenzy Tells You Everything

Here’s what makes Etched’s story fascinating beyond the technical specs: the fundraising arc.

In 2023, Gavin Uberti and Robert Wachen — two Harvard dropouts who became Thiel Fellows — pitched every major investor with a 30-page memo arguing that AI would eventually need specialized inference chips. They got rejected everywhere. The company was operating month-to-month, close to running out of cash.

By late 2025, they closed a $500 million round at a $5 billion post-money valuation. Total funding reached $800 million. Their angel investors read like a who’s who of AI: Andrej Karpathy, Geoffrey Hinton, Fei-Fei Li, Arthur Mensch (Mistral’s CEO), and Scott Wu (Cognition AI). Stanley Druckenmiller and Peter Thiel wrote personal checks.

What changed between 2023 and 2025? The inference bottleneck became real. As AI models went from research projects to consumer products serving millions of daily users, the cost of inference became the single biggest constraint on AI adoption. Companies weren’t just building bigger models — they were struggling to serve the ones they had.

This mirrors a pattern I’ve seen before in tech: the infrastructure layer gets crowded with general-purpose solutions, then specialized players emerge to solve the specific bottleneck that’s holding everyone back. It happened with networking (generic switches → smart NICs), storage (general disks → NVMe SSDs), and now it’s happening with AI compute.

The Competitive Landscape Is Brutal

Etched isn’t alone. The inference chip market has become one of the most contested spaces in tech.

Nvidia, naturally, is the 800-pound gorilla. Their Blackwell platform delivers $0.12 per million tokens — the lowest in the industry. They have the software ecosystem (CUDA, TensorRT-LLM, Dynamo), the manufacturing scale, and the customer lock-in. Challenging Nvidia on inference is like challenging Google on search: possible in theory, terrifying in practice.

But the challengers keep coming. Cerebras went public in 2026 with its wafer-scale chip. Groq raised $650 million before Nvidia acquired them for $20 billion in December 2025 — that acquisition was essentially Nvidia admitting that inference-specialized hardware is the future. Taalas, a Canadian startup, launched a hard-coded inference chip in February 2026 claiming 16,960 tokens per second per user on Llama 3.1 8B, roughly 48x faster than a B200.

And then there are the hyperscalers. Amazon has Trainium and Inferentia — part of the trend of companies building their own silicon. Google has TPUs. Google has TPUs. Microsoft and Meta are building custom chips in-house. OpenAI just unveiled its first custom silicon, built by Broadcom. Everyone is trying to reduce their dependence on Nvidia, and inference is where the real edge sits.

What makes Etched’s bet interesting is the purity of their approach. Cerebras builds massive chips that try to do everything faster. Groq uses a dataflow architecture with on-chip SRAM. Etched says: *we only do transformers, and we do them better than anyone else*. It’s a narrower bet, but the payoff is proportionally larger if transformers remain dominant.

What This Means for Developers and AI Companies

If Etced’s claims hold up — and that’s still a big “if” — the implications are significant.

For AI startups, cheaper inference means the difference between a viable business and a money pit. If you’re building an AI coding assistant, a customer service bot, or a content generation tool, your biggest expense is probably API calls. A chip that delivers 10x better performance per dollar on inference changes the economics of your entire business model.

For cloud providers, specialized inference chips create competitive pressure to offer lower prices. When Etced’s frontier inference clusters hit the market, AWS, Azure, and GCP will need to respond — either by adopting Etced’s hardware or by accelerating their own custom chip programs.

For developers building with AI — whether you are running AI models locally on your own hardware or calling APIs — cheaper inference means you can run more complex models, process more tokens, and build more ambitious applications without the cost ceiling stopping you. It’s the kind of infrastructure improvement that enables the next wave of AI applications — the ones that are too expensive to build today.

The chess analogy feels apt here. In chess, you don’t win by having the most powerful pieces — you win by positioning them strategically. Etced’s bet isn’t that they can build a better GPU. It’s that they can build a better *tool* for the specific problem that’s holding AI back right now. That’s a strategic move, not a brute-force one.

The Risk: Betting on a Single Architecture

Every ASIC carries the same risk: obsolescence if the workload changes. If a new architecture emerges that dethrones transformers — and there are candidates, like state space models — the geopolitical angle matters too, as how Asia is responding to AI chip restrictions showed — and there are candidates, like state space models and hybrid architectures — Etced’s chip becomes a very expensive paperweight.

This is the trade-off inherent in specialization. GPUs survived for decades because they were flexible enough to adapt to new workloads. ASICs are faster for today’s problem but brittle in the face of change.

Etced would argue that transformers are here to stay, at least for the foreseeable future. And they’re probably right — the entire AI ecosystem is built on transformers, and switching costs are enormous. But “probably right” isn’t the same as “certainly right,” and in chip design, being wrong means billions in sunk costs.

The fact that Nvidia acquired Groq for $20 billion suggests even Nvidia believes the future is specialized inference hardware. That’s the strongest endorsement Etced could ask for — if the king of GPUs is buying inference startups, the thesis is probably right.

Where This Goes

Etced is coming out of stealth with $1 billion in pre-orders and a $5 billion valuation. Their chip is manufactured, in customer testing, and on track to ship. The story is real.

But the story is also just beginning. Manufacturing a chip is one thing. Scaling production to meet a billion dollars in demand is another. Building the software stack — compilers, libraries, debugging tools — that makes a chip actually usable by developers is yet another. And maintaining performance advantages as Nvidia, AMD, and everyone else iterate on their own inference hardware is a race that never ends.

For now, Etced’s success is a signal. The inference bottleneck is real, it’s expensive, and the market is willing to bet billions on solutions. Whether those bets pay off will determine the economics of AI for the next decade.

And that’s a game worth watching closely.

—

*Sources: TechCrunch (June 30, 2026), Jon Peddie Research, Spheron Network, Nvidia inference benchmarks, TrendForce*