We ran the same Claude Code against two backends -- Anthropic's flagship Opus 4.8 (direct) and typed -- on a battery of medium-to-hard coding tasks: real SWE-bench Verified bugs, plus algorithm, performance, and edge-case tasks in both Python and TypeScript. Every task is graded by running its real test suite (pytest / vitest) -- no model judges another model.

They tied, 14/14. Then we built a harder round specifically to separate them -- performance gates, backtracking, fiddly text-justification, insight-dependent algorithms -- and they tied on those too, plus a 12/12 insight-task ceiling probe run three times on each side. On objective, test-graded coding tasks, you can't tell them apart by the results. Both are excellent; the difference is price and how you get billed.

The full writeup -- the per-round results table, the methodology, the speed comparison, and the honest caveats -- is on the typed blog:

Read the full benchmark on typed.cloud →

Or try it yourself: point your existing Claude Code at typed and run a real session against your own codebase. Swap the variables back any time.

Published by Yaw Labs.

Related Articles