Mitigate prompt-injection risks from tool/resource content (sanitization & trust metadata)

Open dgenio opened this issue 3 months ago • 0 comments

Description

Summary

Tools and resources can currently return arbitrary content (text, JSON, etc.) that is forwarded directly to clients and often to LLMs. There is no built-in way to:

sanitize content, or
indicate its trust level.

This leaves room for prompt-injection attacks and other malicious payloads, especially when MCP servers interact with untrusted files or external APIs.

Problem

Common scenarios:

A resource reads from file:// or another untrusted source; the file can contain prompt-injection content such as "Ignore all previous instructions and …".
A tool calls an external API; the HTTP response body is forwarded directly to the LLM.
Content is serialized in a way that might encode protocol-like structures or control sequences the client is not expecting.

The MCP SDK does not currently:

offer a built-in ContentSanitizer or similar abstraction, or
attach metadata to content that marks it as trusted vs external vs user_provided.

This makes it harder for client runtimes and LLM orchestrators to apply different safety policies depending on the source.

Proposal

Introduce a content sanitization hook
- Add an optional ContentSanitizer (or similar) interface that can be configured on the server:
  - sanitize_text(text: str, meta: ContentMeta) -> str
  - sanitize_json(obj: Any, meta: ContentMeta) -> Any
- Provide a default implementation that is conservative but non-breaking (e.g., escaping obvious control sequences while leaving plain text mostly untouched).
- Allow servers to plug in more aggressive sanitizers depending on their threat model.
Content trust metadata
- Extend content structures with a trust_level field, something like:
  - trusted – server-generated system content,
  - external – from APIs, files, databases,
  - user_provided – direct user input / uploads.
- This can be optional at first, defaulting to a sensible value, but enables clients/agents to treat content differently.
Documentation & examples
- Add a “Security / Prompt Injection” section that:
  - explains common patterns where prompt injection can appear,
  - shows how to configure a sanitizer,
  - illustrates how trust levels can be used by clients.

Why this matters

MCP is often used as a bridge between LLMs and external data sources.
Prompt injection is one of the primary risks for GenAI systems today.
Having first-class support for sanitization and trust metadata at the SDK level makes it easier for server authors and client runtimes to implement safe defaults.

Acceptance criteria

[ ] A pluggable sanitization hook is exposed at the server layer.
[ ] Content objects include optional trust metadata (or there is a clear extension point for it).
[ ] Examples and docs illustrate how to configure sanitization and how clients can use trust metadata.
[ ] The default behavior remains non-breaking but can be hardened by users.

References

No response

Nov 28 '25 12:11 dgenio