Add configurable concurrency limits and backpressure for MCP servers

Open dgenio opened this issue 1 month ago • 0 comments

Description

Summary

MCP servers built with the Python SDK currently process incoming requests as fast as they arrive, limited mostly by system resources. There are no built-in mechanisms to:

limit concurrent tool executions,
apply backpressure when the server is overloaded,
surface overload conditions in a structured way.

This is particularly important when LLMs or agents can generate many tool calls in parallel.

Problems

DoS / overload risk: A buggy or malicious client can issue hundreds or thousands of tool calls in parallel.
Resource exhaustion: Long-running tools can accumulate in flight, consuming memory and CPU.
No explicit overload signal: Clients have no way to know the server is overloaded beyond timeouts or generic errors.

Proposal

Configurable concurrency limits
- Add a setting (e.g. max_concurrent_tools) enforced by a semaphore in the server:
  - Only N tool executions are active at once.
  - Additional requests wait in a queue, up to a limit.
Request queue and overload handling
- Maintain a small queue of pending requests.
- When the queue is full, reject new requests with a clear, structured error (e.g. “server overloaded” / 429-like semantics).
Backpressure integration
- Where transports support it (e.g. HTTP status codes), reflect overload in transport responses.
- For other transports, return a well-defined MCP error code indicating overload.
Configuration & docs
- Expose configuration parameters with sensible defaults.
- Document how to tune these settings for different deployment scenarios.

Why this matters

Robustness: Servers degrade gracefully under load instead of crashing or hanging.
Predictability: Clients and LLM agents can interpret overload errors and adjust behavior (e.g., backoff and retry).
Security: Basic protection against accidental or deliberate flooding.

Acceptance criteria

[ ] Server supports a configurable max_concurrent_tools limit.
[ ] Server supports a bounded queue for pending requests and rejects new ones when full.
[ ] Overload conditions are surfaced via a clear, documented error code.
[ ] Documentation describes how to configure and interpret these limits.

References

No response

Nov 28 '25 15:11 dgenio