fastmcp icon indicating copy to clipboard operation
fastmcp copied to clipboard

Add response limiting middleware

Open strawgate opened this issue 3 months ago • 2 comments

Enhancement

Let's add middleware that limits the response size of tool calls, either trimming the response if it's not a structured response or just raising a tool error

strawgate avatar Oct 04 '25 16:10 strawgate

Claude finished @strawgate's task —— View job


Investigation Complete ✓

Recommendation

Implement a new ResponseLimitingMiddleware in src/fastmcp/server/middleware/response_limiting.py that intercepts tool call responses and enforces size limits. The middleware should:

  1. Check response size after tool execution
  2. For structured responses (with output_schema): Raise a ToolError when exceeding limit
  3. For unstructured responses (text-only): Truncate to the specified limit with an indicator
  4. Configuration options: Max size (bytes), truncation suffix, specific tool filtering

This follows established patterns in FastMCP's existing middleware (LoggingMiddleware, ErrorHandlingMiddleware) and leverages the on_call_tool hook to target tool responses specifically.

Findings

Context

Issue #2004 requests middleware to limit response sizes from tool calls. This is particularly important for:

  • Preventing context window overflow in LLM applications
  • Managing memory usage for large responses
  • Enforcing data transfer limits in production environments

Architecture Analysis

Existing Middleware Patterns:

FastMCP has a mature middleware system with several implementations:

  • LoggingMiddleware (logging.py): Intercepts all messages, estimates payload size/tokens
  • ErrorHandlingMiddleware (error_handling.py): Transforms exceptions to MCP errors
  • TimingMiddleware (timing.py): Tracks request duration
  • RateLimitingMiddleware (rate_limiting.py): Enforces request frequency limits

All middleware inherit from the Middleware base class which provides hooks like:

  • on_message: All MCP messages
  • on_call_tool: Specifically for tool execution (perfect for this use case)
  • on_request: All requests expecting responses

Tool Response Structure:

Tool responses return a ToolResult object (src/fastmcp/tools/tool.py:66-103) which contains:

  • content: List of ContentBlock objects (can include TextContent, ImageContent, etc.)
  • structured_content: Optional dict for tools with output_schema defined

Error Handling:

FastMCP defines ToolError (src/fastmcp/exceptions.py:18) for tool operation errors. The ErrorHandlingMiddleware already demonstrates how to transform errors in middleware.

Size Estimation:

LoggingMiddleware already implements payload size estimation:

payload_length = len(payload)  # Character count
payload_tokens = payload_length // 4  # Rough token estimate

Related Work

Issue #1844 - Response Caching Middleware:

  • Mentions "max size of item for cache would be good too, maybe 1MB default"
  • Draft PR #1845 implements caching with size limits
  • Shows pattern for handling response size considerations

Issue #1854 - Common Utilities for Serializing/Sizing:

  • Identifies need for consolidated utilities for estimating object size
  • LoggingMiddleware already uses pydantic_core.to_json() for serialization
  • Could potentially reuse this infrastructure

Issue #1988 - Call Tool Middleware:

  • Discusses how exceptions propagate through middleware stack
  • Relevant for understanding error handling in response limiting

PR #1805 - Refactor Logging Middleware:

  • Consolidated logging implementations
  • Shows best practices for middleware configuration
  • Includes payload size/token estimation
Detailed Action Plan

Implementation Steps

1. Create Response Limiting Middleware

File: src/fastmcp/server/middleware/response_limiting.py

Class Structure:

class ResponseLimitingMiddleware(Middleware):
    def __init__(
        self,
        *,
        max_size: int = 1_000_000,  # 1MB default
        truncation_suffix: str = "\n\n[Response truncated due to size limit]",
        tools: list[str] | None = None,  # Filter specific tools
        raise_on_structured: bool = True,  # Raise error for structured responses
        logger: logging.Logger | None = None,
    ):
        ...
    
    async def on_call_tool(
        self,
        context: MiddlewareContext[mt.CallToolRequestParams],
        call_next: CallNext[mt.CallToolRequestParams, ToolResult],
    ) -> ToolResult:
        # 1. Execute tool via call_next
        # 2. Serialize result to measure size
        # 3. Check if size exceeds limit
        # 4. Handle based on structured vs unstructured
        # 5. Return modified or original result
        ...

Key Implementation Details:

  1. Size Measurement:

    • Use pydantic_core.to_json() to serialize the ToolResult
    • Measure the byte length of serialized content
    • This matches how LoggingMiddleware estimates size
  2. Structured Response Handling:

    • Check if ToolResult.structured_content is not None
    • If structured AND over limit: Raise ToolError with clear message
    • Rationale: Structured responses shouldn't be corrupted by truncation
  3. Unstructured Response Handling:

    • Extract text from TextContent blocks
    • Truncate to max_size - len(truncation_suffix)
    • Append the truncation suffix
    • Create new ToolResult with truncated content
  4. Tool Filtering:

    • If tools parameter specified, only apply to those tools
    • Use context.message.name to check tool name

2. Add Unit Tests

File: tests/server/middleware/test_response_limiting.py

Test Cases:

  • Test structured response over limit raises ToolError
  • Test unstructured response over limit gets truncated
  • Test response under limit passes unchanged
  • Test tool filtering (only specific tools limited)
  • Test truncation suffix is applied correctly
  • Test with different content types (text, images, etc.)
  • Test integration with ErrorHandlingMiddleware

Example Test Structure:

async def test_truncate_large_unstructured_response():
    mcp = FastMCP("test")
    mcp.add_middleware(ResponseLimitingMiddleware(max_size=100))
    
    @mcp.tool()
    def large_response() -> str:
        return "x" * 1000
    
    async with Client(mcp) as client:
        result = await client.call_tool("large_response", {})
        # Assert truncation occurred
        assert len(result.content[0].text) <= 100
        assert "[Response truncated" in result.content[0].text

3. Update Documentation

File: docs/servers/middleware.mdx

Add section after existing middleware examples:

Content:

  • Overview of response limiting use cases
  • Configuration options
  • Example: Basic usage
  • Example: Different limits for different tools
  • Example: Handling structured vs unstructured responses
  • Best practices for setting size limits

4. Update Exports

File: src/fastmcp/server/middleware/__init__.py

Add:

from .response_limiting import ResponseLimitingMiddleware

5. Integration Points

Consider these integration scenarios:

  1. With LoggingMiddleware:

    • Log when truncation occurs
    • Include original vs final size in logs
  2. With ErrorHandlingMiddleware:

    • Ensure ToolError for structured responses is handled properly
    • Test error transformation and logging
  3. With CachingMiddleware (future):

    • Response size might affect caching decisions
    • Could share size calculation utilities

Testing Strategy

  1. Unit Tests: Test middleware in isolation
  2. Integration Tests: Test with other middleware
  3. Performance Tests: Ensure minimal overhead for small responses
  4. Edge Cases: Empty responses, binary content, mixed content blocks

Alternative Considerations

Token-Based Limiting:

  • Could use length // 4 like LoggingMiddleware
  • More relevant for LLM applications
  • Could be a future enhancement

Streaming/Chunking:

  • Could implement progressive size checking for streaming responses
  • More complex but could prevent wasted computation
  • Consider for future enhancement

Utility Function:

  • Could extract size calculation to src/fastmcp/utilities/ per issue #1854
  • Would benefit both this middleware and caching middleware
  • Recommend doing this if implementing both features
Related Items

Related Issues

Repository Issue Relevance
jlowin/fastmcp Response Caching Middleware #1844 Also needs max response size handling; mentions "1MB default" as a good limit. Draft PR #1845 implements this.
jlowin/fastmcp Common utilities for serializing/deserializing and estimating size #1854 Proposes consolidated utility functions for sizing MCP response types - could be leveraged here.
jlowin/fastmcp Call tool middleware #1988 Discusses exception propagation through middleware stack - relevant for error handling.
jlowin/fastmcp Add request duration to logging middleware #1973 Recent middleware enhancement showing active middleware development.

Related Pull Requests

Repository PR Relevance
jlowin/fastmcp Add Response Caching Middleware #1845 Draft PR implementing caching with max item size support - similar size limiting concerns.
jlowin/fastmcp Refactor Logging and Structured Logging Middleware #1805 Merged - shows best practices for middleware implementation, includes payload size estimation.
jlowin/fastmcp Updates to Logging Middleware #1974 Open - adds request duration tracking; shows ongoing middleware improvements.
jlowin/fastmcp Internal refactor of MCP handlers #2005 Open - refactors handler naming conventions; may affect middleware patterns.
Related Files
Repository File Relevance Key Sections
jlowin/fastmcp middleware.py High - Base class and patterns 80-196 - Middleware base class, on_call_tool hook
jlowin/fastmcp logging.py High - Size estimation pattern 32-46 - Payload serialization, 69-77 - Size/token estimation
jlowin/fastmcp error_handling.py High - Error transformation 78-108 - Exception handling and transformation
jlowin/fastmcp tool.py High - ToolResult structure 66-103 - ToolResult class definition
jlowin/fastmcp exceptions.py Medium - Error types 18-19 - ToolError definition
jlowin/fastmcp middleware.mdx High - Documentation structure 1-100 - Middleware documentation patterns

Such middleware could also be used for, e.g., additional logging.

jessealama avatar Oct 14 '25 06:10 jessealama