Fix: Prevent session manager shutdown on individual session crash
Previously, an unhandled exception within a single MCP session's MCPServer.run() task could propagate to the StreamableHTTPSessionManager's main task group. This would cause the entire task group to cancel, effectively shutting down the session manager and terminating all active sessions.
This commit addresses the issue by:
- Wrapping the
self.app.run(...)call within therun_server(for stateful requests) andrun_stateless_server(for stateless requests) inner functions inStreamableHTTPSessionManagerwith atry...except Exceptionblock. - Logging any caught exceptions along with the session ID (for stateful requests) to aid in debugging the crashed session.
This change ensures that if a single session encounters an unexpected error and crashes, it only affects that specific session. The StreamableHTTPSessionManager will continue to run, and other active sessions will remain operational. This significantly improves the robustness and availability of the server.
Motivation and Context
Unhandled exceptions such as a network error would render the server unusable until restart
How Has This Been Tested?
Yes, specifically by generating client disconnects and observing the server log the unhandled error but remain running and stable
Breaking Changes
No breaking changes
Types of changes
- [X] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Documentation update
Checklist
- [X] I have read the MCP Documentation
- [X] My code follows the repository's style guidelines
- [X] New and existing tests pass locally
- [X] I have added appropriate error handling
- [X] I have added or updated documentation as needed
Additional context
To address a test failure, I updated the cleanup logic to only remove error sessions and not those that have been explicitly terminated. I see that the terminated sessions are retained and used for the 404 vs 400 return code logic. However, I do not see any place where those are ever removed so I imagine they just accumulate until server restart. It's not within the scope of this PR to address but that's likely not desirable behavior.
any update on this @soby? maybe they implemented this in some other issue perhaps?
@UYousafzai it's ready to merge IMO. I've been running it in a fork and it works as intended to keep the server up and stable. Hopefully it can make it into tomorrow's release. @ihrpr ?
Fairly certain this is a duplicate of https://github.com/modelcontextprotocol/python-sdk/issues/820 and https://github.com/modelcontextprotocol/python-sdk/pull/822 which is associated with a hackerone issue
@Sillocan Those are for handling of specific exception sources within a connection (what @NAVNAV221 was referencing). This PR implements catch-all exception handling for all stream/sessions such that a future unhandled exceptions within the stream/session are logged but do not corrupt the whole server until restarted
@NAVNAV221 Is this good to merge?
@ihrpr gtg