There are now hundreds of MCP servers on npm and PyPI, and almost none of them have been tested against the actual specification. So we built @yawlabs/mcp-compliance - an 88-test suite that grades any MCP server (HTTP or stdio) against the 2025-11-25 spec in under 30 seconds.
npx -y @yawlabs/mcp-compliance test <url-or-stdio-command>It is open source and free. Here is what we found running it across the reference implementations and a chunk of the public ecosystem.
The test suite
88 tests across 8 categories:
| Category | Tests | What it covers |
|---|---|---|
| Transport | 16 | HTTP POST, content types (JSON/SSE), 202 Accepted on notifications, streaming |
| Lifecycle | 21 | initialize, protocol version, capabilities, MCP-Session-Id, ping, cancellation, progress |
| Tools | 4 | tools/list, schema shape, invocation, unknown-tool handling |
| Resources | 5 | resources/list, reading, templates, URI validation, subscribe |
| Prompts | 3 | prompts/list, prompts/get, pagination |
| Errors | 10 | JSON-RPC error codes, malformed input, missing params, unknown methods |
| Schema | 6 | Tool names, inputSchema shape, resource URIs, prompt argument names |
| Security | 23 | Auth & transport, input validation, tool-description injection, information disclosure, SSRF |
Required tests are 70% of the score, optional 30%. Letter grades: A (90+), B (75+), C (60+), D (40+), F (<40). Capability-gated tests (tools, resources, prompts, subscribe, logging, completions) only run if the server declares the capability - no false failures for features the server never claimed.
The reference servers are clean
We pointed the suite at every @modelcontextprotocol/server-* package on npm. All five grade A (98–100%), zero required-test failures. The TypeScript SDK does its job: transport, lifecycle, and tool-schema basics work out of the box.
The interesting question is what happens once you leave the reference set.
What most servers get wrong
1. Error handling is the biggest gap
The spec says servers return JSON-RPC errors with proper codes for unknown methods and malformed requests. In practice, this is where most non-reference servers fall apart:
- HTTP 200 with no body for unknown methods, instead of
-32601 Method not found - HTTP 500 with a stack trace, instead of a structured JSON-RPC error
- Process crashes on malformed JSON-RPC input
tools/callwith a nonexistent tool name returns a success envelope, so the client cannot distinguish “tool not found” from “call succeeded with empty result”
The SDKs could close most of this by returning -32601 for any unrecognized method by default, instead of leaving it to each server author.
2. Session affinity breaks behind load balancers
The spec defines MCP-Session-Id for session tracking but does not say what happens when a proxy or load balancer sits between client and server. Every stateful MCP deployment needs sticky routing, and every operator is figuring it out independently.
A shared key-value store keyed by MCP-Session-Id with a short TTL works fine. The point is that it should be documented once, not reinvented by every team running MCP servers in production.
3. SSE streaming through proxies is fragile
The Streamable HTTP transport works in direct connections. Put a reverse proxy in the middle and:
- nginx buffers SSE events by default - you need
proxy_buffering off;or events arrive in batches. - Caddy requires
flush_interval -1to stream properly. - Cloudflare buffers by default and needs explicit configuration to pass SSE through.
- No proxy handles backpressure correctly without explicit implementation. Slow client + fast server = unbounded memory growth in the proxy.
A heartbeat every ~15 seconds keeps connections alive through intermediate infrastructure. The spec mentions heartbeats but does not recommend a frequency.
4. Security is the long tail
The 23 security tests cover four areas: auth & transport, input validation, tool integrity, and information disclosure. Recurring failure modes:
- Token echo - auth tokens returned in error messages or logged at
infolevel by default. - Unbounded input - tools that accept arbitrarily large string parameters. Trivial DoS vector; multi-megabyte payloads accepted without rejection.
- Tool-description injection - tool metadata is not sanitized before the LLM serializes it. Malicious descriptions can hijack prompts. Subtle and under-discussed.
- Stack-trace leaks - Node and Python error responses that ship file paths, framework versions, and sometimes environment details straight to the client.
5. SSRF is a real risk anywhere a URL is user-supplied
Any tool or platform that accepts a user-controlled URL (the fetch pattern, server routers, gateways, mcp.hosting itself) needs:
- Private IP blocking (RFC 1918, loopback, link-local, cloud metadata endpoints like
169.254.169.254) - DNS resolution checks on every request, not just at configuration time
- Fail-closed behavior when DNS resolution fails
Checking only at configuration time leaves a DNS-rebinding window: an attacker points their domain at a public IP during validation, then swaps to a metadata IP before the first real request. The spec does not cover this yet - it probably should.
What we would suggest for the spec
- Document a recommended SSE heartbeat interval (15 seconds works in practice).
- Add an informational section on proxy/load balancer configuration - header forwarding, buffering, timeouts, backpressure.
- Recommend that SDKs return proper JSON-RPC errors for unknown methods by default.
- Standardize an optional discovery endpoint (
/.well-known/mcp) so clients can check server metadata without a fullinitializehandshake. - Add security guidance - token handling in errors, input size limits, tool-description sanitization, SSRF protection.
We have draft spec PRs for several of these in docs/spec-prs/.
Try it on your server
One command, no signup, ~30 seconds:
npx -y @yawlabs/mcp-compliance test https://your-server-url
# or stdio
npx -y @yawlabs/mcp-compliance test npx -y @your-org/your-mcp-serverTo publish a result and get a README badge:
npx -y @yawlabs/mcp-compliance badge https://your-server-urlThat prints a markdown badge snippet you can drop into your README.
If you also want to stop hand-editing MCP JSON configs across Claude Code, Cursor, and Claude Desktop - that is what mcp.hosting does with @yawlabs/mcph. Different post.
Jeff Yaw, Yaw Labs. Follow along at tokenlimit.news for weekly notes on AI infrastructure.