python-sdk
python-sdk copied to clipboard
Add configurable concurrency limits and backpressure for MCP servers
Description
Summary
MCP servers built with the Python SDK currently process incoming requests as fast as they arrive, limited mostly by system resources. There are no built-in mechanisms to:
- limit concurrent tool executions,
- apply backpressure when the server is overloaded,
- surface overload conditions in a structured way.
This is particularly important when LLMs or agents can generate many tool calls in parallel.
Problems
- DoS / overload risk: A buggy or malicious client can issue hundreds or thousands of tool calls in parallel.
- Resource exhaustion: Long-running tools can accumulate in flight, consuming memory and CPU.
- No explicit overload signal: Clients have no way to know the server is overloaded beyond timeouts or generic errors.
Proposal
-
Configurable concurrency limits
- Add a setting (e.g.
max_concurrent_tools) enforced by a semaphore in the server:- Only
Ntool executions are active at once. - Additional requests wait in a queue, up to a limit.
- Only
- Add a setting (e.g.
-
Request queue and overload handling
- Maintain a small queue of pending requests.
- When the queue is full, reject new requests with a clear, structured error (e.g. “server overloaded” / 429-like semantics).
-
Backpressure integration
- Where transports support it (e.g. HTTP status codes), reflect overload in transport responses.
- For other transports, return a well-defined MCP error code indicating overload.
-
Configuration & docs
- Expose configuration parameters with sensible defaults.
- Document how to tune these settings for different deployment scenarios.
Why this matters
- Robustness: Servers degrade gracefully under load instead of crashing or hanging.
- Predictability: Clients and LLM agents can interpret overload errors and adjust behavior (e.g., backoff and retry).
- Security: Basic protection against accidental or deliberate flooding.
Acceptance criteria
- [ ] Server supports a configurable
max_concurrent_toolslimit. - [ ] Server supports a bounded queue for pending requests and rejects new ones when full.
- [ ] Overload conditions are surfaced via a clear, documented error code.
- [ ] Documentation describes how to configure and interpret these limits.
References
No response