python-sdk icon indicating copy to clipboard operation
python-sdk copied to clipboard

Mitigate prompt-injection risks from tool/resource content (sanitization & trust metadata)

Open dgenio opened this issue 1 month ago • 0 comments

Description

Summary

Tools and resources can currently return arbitrary content (text, JSON, etc.) that is forwarded directly to clients and often to LLMs. There is no built-in way to:

  • sanitize content, or
  • indicate its trust level.

This leaves room for prompt-injection attacks and other malicious payloads, especially when MCP servers interact with untrusted files or external APIs.

Problem

Common scenarios:

  • A resource reads from file:// or another untrusted source; the file can contain prompt-injection content such as "Ignore all previous instructions and …".
  • A tool calls an external API; the HTTP response body is forwarded directly to the LLM.
  • Content is serialized in a way that might encode protocol-like structures or control sequences the client is not expecting.

The MCP SDK does not currently:

  • offer a built-in ContentSanitizer or similar abstraction, or
  • attach metadata to content that marks it as trusted vs external vs user_provided.

This makes it harder for client runtimes and LLM orchestrators to apply different safety policies depending on the source.

Proposal

  1. Introduce a content sanitization hook

    • Add an optional ContentSanitizer (or similar) interface that can be configured on the server:
      • sanitize_text(text: str, meta: ContentMeta) -> str
      • sanitize_json(obj: Any, meta: ContentMeta) -> Any
    • Provide a default implementation that is conservative but non-breaking (e.g., escaping obvious control sequences while leaving plain text mostly untouched).
    • Allow servers to plug in more aggressive sanitizers depending on their threat model.
  2. Content trust metadata

    • Extend content structures with a trust_level field, something like:
      • trusted – server-generated system content,
      • external – from APIs, files, databases,
      • user_provided – direct user input / uploads.
    • This can be optional at first, defaulting to a sensible value, but enables clients/agents to treat content differently.
  3. Documentation & examples

    • Add a “Security / Prompt Injection” section that:
      • explains common patterns where prompt injection can appear,
      • shows how to configure a sanitizer,
      • illustrates how trust levels can be used by clients.

Why this matters

  • MCP is often used as a bridge between LLMs and external data sources.
  • Prompt injection is one of the primary risks for GenAI systems today.
  • Having first-class support for sanitization and trust metadata at the SDK level makes it easier for server authors and client runtimes to implement safe defaults.

Acceptance criteria

  • [ ] A pluggable sanitization hook is exposed at the server layer.
  • [ ] Content objects include optional trust metadata (or there is a clear extension point for it).
  • [ ] Examples and docs illustrate how to configure sanitization and how clients can use trust metadata.
  • [ ] The default behavior remains non-breaking but can be hardened by users.

References

No response

dgenio avatar Nov 28 '25 12:11 dgenio