We Tested MCP Servers Against the 2025-11-25 Spec - Here Is What We Found

There are now hundreds of MCP servers on npm and PyPI, and almost none of them have been tested against the actual specification. So we built @yawlabs/mcp-compliance - an 88-test suite that grades any MCP server (HTTP or stdio) against the 2025-11-25 spec in under 30 seconds.

npx -y @yawlabs/mcp-compliance test <url-or-stdio-command>

It is open source and free. Here is what we found running it across the reference implementations and a chunk of the public ecosystem.

The test suite

88 tests across 8 categories:

Category	Tests	What it covers
Transport	16	HTTP POST, content types (JSON/SSE), `202 Accepted` on notifications, streaming
Lifecycle	21	`initialize`, protocol version, capabilities, `MCP-Session-Id`, ping, cancellation, progress
Tools	4	`tools/list`, schema shape, invocation, unknown-tool handling
Resources	5	`resources/list`, reading, templates, URI validation, subscribe
Prompts	3	`prompts/list`, `prompts/get`, pagination
Errors	10	JSON-RPC error codes, malformed input, missing params, unknown methods
Schema	6	Tool names, `inputSchema` shape, resource URIs, prompt argument names
Security	23	Auth & transport, input validation, tool-description injection, information disclosure, SSRF

Required tests are 70% of the score, optional 30%. Letter grades: A (90+), B (75+), C (60+), D (40+), F (<40). Capability-gated tests (tools, resources, prompts, subscribe, logging, completions) only run if the server declares the capability - no false failures for features the server never claimed.

The reference servers are clean

We pointed the suite at every @modelcontextprotocol/server-* package on npm. All five grade A (98–100%), zero required-test failures. The TypeScript SDK does its job: transport, lifecycle, and tool-schema basics work out of the box.

The interesting question is what happens once you leave the reference set.

What most servers get wrong

1. Error handling is the biggest gap

The spec says servers return JSON-RPC errors with proper codes for unknown methods and malformed requests. In practice, this is where most non-reference servers fall apart:

HTTP 200 with no body for unknown methods, instead of -32601 Method not found
HTTP 500 with a stack trace, instead of a structured JSON-RPC error
Process crashes on malformed JSON-RPC input
tools/call with a nonexistent tool name returns a success envelope, so the client cannot distinguish “tool not found” from “call succeeded with empty result”

The SDKs could close most of this by returning -32601 for any unrecognized method by default, instead of leaving it to each server author.

2. Session affinity breaks behind load balancers

The spec defines MCP-Session-Id for session tracking but does not say what happens when a proxy or load balancer sits between client and server. Every stateful MCP deployment needs sticky routing, and every operator is figuring it out independently.

A shared key-value store keyed by MCP-Session-Id with a short TTL works fine. The point is that it should be documented once, not reinvented by every team running MCP servers in production.

3. SSE streaming through proxies is fragile

The Streamable HTTP transport works in direct connections. Put a reverse proxy in the middle and:

nginx buffers SSE events by default - you need proxy_buffering off; or events arrive in batches.
Caddy requires flush_interval -1 to stream properly.
Cloudflare buffers by default and needs explicit configuration to pass SSE through.
No proxy handles backpressure correctly without explicit implementation. Slow client + fast server = unbounded memory growth in the proxy.

A heartbeat every ~15 seconds keeps connections alive through intermediate infrastructure. The spec mentions heartbeats but does not recommend a frequency.

4. Security is the long tail

The 23 security tests cover four areas: auth & transport, input validation, tool integrity, and information disclosure. Recurring failure modes:

Token echo - auth tokens returned in error messages or logged at info level by default.
Unbounded input - tools that accept arbitrarily large string parameters. Trivial DoS vector; multi-megabyte payloads accepted without rejection.
Tool-description injection - tool metadata is not sanitized before the LLM serializes it. Malicious descriptions can hijack prompts. Subtle and under-discussed.
Stack-trace leaks - Node and Python error responses that ship file paths, framework versions, and sometimes environment details straight to the client.

5. SSRF is a real risk anywhere a URL is user-supplied

Any tool or platform that accepts a user-controlled URL (the fetch pattern, server routers, gateways, mcp.hosting itself) needs:

Private IP blocking (RFC 1918, loopback, link-local, cloud metadata endpoints like 169.254.169.254)
DNS resolution checks on every request, not just at configuration time
Fail-closed behavior when DNS resolution fails

Checking only at configuration time leaves a DNS-rebinding window: an attacker points their domain at a public IP during validation, then swaps to a metadata IP before the first real request. The spec does not cover this yet - it probably should.

What we would suggest for the spec

Document a recommended SSE heartbeat interval (15 seconds works in practice).
Add an informational section on proxy/load balancer configuration - header forwarding, buffering, timeouts, backpressure.
Recommend that SDKs return proper JSON-RPC errors for unknown methods by default.
Standardize an optional discovery endpoint (/.well-known/mcp) so clients can check server metadata without a full initialize handshake.
Add security guidance - token handling in errors, input size limits, tool-description sanitization, SSRF protection.

We have draft spec PRs for several of these in docs/spec-prs/.

Try it on your server

One command, no signup, ~30 seconds:

npx -y @yawlabs/mcp-compliance test https://your-server-url
# or stdio
npx -y @yawlabs/mcp-compliance test npx -y @your-org/your-mcp-server

To publish a result and get a README badge:

npx -y @yawlabs/mcp-compliance badge https://your-server-url

That prints a markdown badge snippet you can drop into your README.

If you also want to stop hand-editing MCP JSON configs across Claude Code, Cursor, and Claude Desktop - that is what mcp.hosting does with @yawlabs/mcph. Different post.

Jeff Yaw, Yaw Labs. Follow along at tokenlimit.news for weekly notes on AI infrastructure.