Grading MCP Servers A to F: 88 Tests Against the Spec

Earlier this month, the Model Context Protocol was donated to the Linux Foundation. That moves MCP out of "interesting Anthropic spec" and into the same category as Kubernetes, OpenTelemetry, and Cloud Native Computing Foundation projects - industry-governed, vendor-neutral, here for the long haul.

The protocol is ready for serious infrastructure. The ecosystem isn't quite there yet.

Every week someone publishes a new MCP server on npm. It probably works. It might follow the spec. It might handle lifecycle transitions correctly. It might prevent prompt injection through a malicious resource URI. It might not. You find out by installing it and hoping.

That's not a trust layer. That's a guess.

88 Tests, 8 Categories, 30 Seconds

We built @yawlabs/mcp-compliance to answer the question mechanically. Paste a URL or command, get a letter grade in under a minute, drop a badge in your README.

The suite runs 88 tests grouped into 8 categories, each against the current MCP specification (2025-11-25). Required tests count for more than optional ones. Transport-gated tests only run on the transport your server actually declares. Capability-driven execution skips tests for features your server says it doesn't implement.

Category	Tests	What it checks
Transport	16	stdio framing, HTTP headers, SSE event format, session ID handling, Origin header enforcement
Lifecycle	21	`initialize` / `initialized` sequence, capability negotiation, shutdown behavior, reconnect semantics
Tools	4	`tools/list`, `tools/call`, input validation, list-changed notifications
Resources	5	Resource discovery, URI templates, subscribe/unsubscribe, update notifications
Prompts	3	Prompt listing, argument completion, prompt retrieval
Errors	10	JSON-RPC error codes, malformed request handling, unknown method behavior
Schema	6	JSON Schema validation for tool inputs, resource descriptors, prompt arguments
Security	23	Prompt injection in resources, tool description safety, URI scheme restrictions, authentication header handling

Security is the largest category deliberately. Half the reason a trust layer is necessary is that MCP servers ship tool descriptions and resource content directly into your LLM context - exactly the injection surface you can't afford to get wrong.

If you're shipping MCP servers yourself, the testing patterns behind this suite are the subject of MCP Server Testing -- the harness pattern that makes a probabilistic consumer testable, drawn from MCP in Production.

Try It Now

One npx command, no install:

npx -y @yawlabs/mcp-compliance test https://your-mcp-server.example.com/mcp

The tester opens a connection, runs through the 88 applicable tests (transport-gated, capability-gated), and returns a report:

Grade: B+ (86/100)

Transport      16/16  ✓ PASS
Lifecycle      19/21  ⚠ 2 optional failures
Tools           4/4   ✓ PASS
Resources       5/5   ✓ PASS
Prompts         3/3   ✓ PASS
Errors          9/10  ⚠ 1 required failure
Schema          6/6   ✓ PASS
Security       22/23  ⚠ 1 optional failure

Required failures:
  ✗ errors-04  JSON-RPC error code for unknown method must be -32601
               Received: -32600 (invalid request)

Optional failures:
  ⚠ lifecycle-15  Server did not echo client's protocolVersion
  ⚠ lifecycle-17  initialized notification not acknowledged
  ⚠ security-22   Tool description contains suspicious instruction phrasing

Or run it locally against an npm package or a stdio binary:

npx @yawlabs/mcp-compliance test -- npx -y @yawlabs/tailscale-mcp

Report Your Grade in Your README

If your server grades A or B, surface it. It's a small trust signal that compounds - when an agent-builder is choosing between three MCP servers for the same job, the one with a published compliance grade wins. Drop the grade into your README, link to @yawlabs/mcp-compliance so readers can re-run the suite themselves, and re-run it in CI to catch regressions.

The Methodology Is Open

The testing rubric is published under CC BY 4.0 at github.com/YawLabs/mcp-compliance. Every test has a stable rule ID, documented severity (required vs. optional), scoring weight, and spec reference. The grade thresholds are published. The machine-readable rule catalog (mcp-compliance-rules.json) is part of the repo.

This is deliberate. A trust layer that's opaque is a second black box on top of the first one. If you don't like how we weight security tests vs. lifecycle tests, fork the rubric. If you think a rule is wrong, open an issue. The methodology is part of the public commons, not a competitive moat.

Capability-Driven, Transport-Gated

One thing worth explaining because it trips people up: not every MCP server implements every feature. Some don't do resources. Some are stdio-only and never touch HTTP. Grading a stdio-only server on HTTP-session-ID rules would be unfair.

So the tester asks the server what it supports (via the initialize response's capabilities object) and only runs tests for features the server declares. Transport-specific tests only run on the transport in use. If your server declares it doesn't implement resources, the 5 resource tests are skipped - not failed.

That means two servers can both get an A without implementing the same feature set. The grade reflects "does this server correctly implement what it claims to implement," not "does this server implement everything in the spec."

What This Is For

Three things:

For MCP server authors: it's a CI check. Run @yawlabs/mcp-compliance in your GitHub Actions pipeline, set a minimum grade threshold, fail the build if your server regresses. Same protection your code tests give you, applied to your spec compliance.

For MCP server consumers: it's a signal. An A grade from an open methodology, backed by a real test run, is a stronger trust primitive than a 5-star rating or a "verified" checkmark. You can inspect the report, see what passed and what didn't, and make your own call.

For the ecosystem: it's a coordination mechanism. When every MCP server has a comparable grade from the same rubric, server authors compete on measurable quality instead of marketing claims. That's how TLS interop got good (SSL Labs grades), how accessibility got better (axe/Lighthouse scores), how performance budgets became a thing (Web Vitals).

Where This Connects to Yaw MCP

The team that built the compliance suite also built Yaw MCP - a local-first CLI for people who actually use MCP servers day-to-day. Every server in the public catalog is automatically graded, and the grade is visible inline on every yaw-mcp discover output before you install. Set YAW_MCP_MIN_COMPLIANCE=B and the CLI refuses to activate anything below the bar.

If you use Claude Code, Claude Desktop, Cursor, VS Code, or any MCP client with multiple servers, it's worth a look. One install - npm install -g @yawlabs/mcp && yaw-mcp install <client> - wires the orchestrator into whichever client you use. One config file you control, no account required for personal use.

But you don't need Yaw MCP to use the compliance tester. It's a standalone npm package, free and open source, and the methodology is documented enough that you could build your own tester against the same rubric if you wanted to. The point is the ecosystem needs a grade, not that any one tool needs to be the grader.

GitHub · npm

Published by Yaw Labs.