Same Claude Code, two backends, every task graded by a real test suite. We looked hard for a quality difference on medium-to-hard coding work -- and report what we actually found. The full writeup lives on the typed blog.
We ran the same Claude Code against two backends -- Anthropic's flagship Opus 4.8 (direct) and typed -- on a battery of medium-to-hard coding tasks: real SWE-bench Verified bugs, plus algorithm, performance, and edge-case tasks in both Python and TypeScript. Every task is graded by running its real test suite (pytest / vitest) -- no model judges another model.
They tied, 14/14. Then we built a harder round specifically to separate them -- performance gates, backtracking, fiddly text-justification, insight-dependent algorithms -- and they tied on those too, plus a 12/12 insight-task ceiling probe run three times on each side. On objective, test-graded coding tasks, you can't tell them apart by the results. Both are excellent; the difference is price and how you get billed.
The full writeup -- the per-round results table, the methodology, the speed comparison, and the honest caveats -- is on the typed blog:
Read the full benchmark on typed.cloud →
Or try it yourself: point your existing Claude Code at typed and run a real session against your own codebase. Swap the variables back any time.
Published by Yaw Labs.