Proposal: Session recovery / shared session storage for distributed MCP env
Is your feature request related to a problem? Please describe.
Currently MCP SDK is using in-memory storage for sessions https://github.com/modelcontextprotocol/go-sdk/blob/dd49b7039b69498b242fe0ffea5016b734e6181c/mcp/streamable.go#L102 which become potentially a tricky place for session recovery scenarios in particular in distributed prod-like environment with multiple instances of MCP servers up and running.
As the result in the scenario when a client has proper session id but for some reason has lost or patially lost the connection in the downstream of an entrypoint (client -> entripoint -> [no connection] downstream), the request may be routed through load balancer to various instances of MCP which may not have session context in memory.
Additionally within the deployment in scalable multi instance enveronment session context due to being in mem only often also can be lost without a way to greacefully recover.
Describe the solution you'd like Potentially introduce shared session storage (e.g redis) for being able to share the session state for graceful recovery within connection loss / re-deployments.
EDIT(@findleyr): The current concrete proposal is in https://github.com/modelcontextprotocol/go-sdk/issues/148#issuecomment-3156607256.
Funny you should ask: https://github.com/jba/go-sdk/blob/event2/mcp/event.go#L144
That should be a PR on top of #145, but my GitHub-fu isn't strong enough to figure out how to do that.
I seen this one but was not sure it will actually allow extending sessions storage, thanks for confirming👍
We are facing the same issue. The official documentation states that:
When a client receives HTTP 404 in response to a request containing an Mcp-Session-Id,
it MUST start a new session by sending a new InitializeRequest without a session ID attached.
It seems like the inspector which we use for testing is not doing that (yet). But even if it would, I think the session could still change between the initialization and the next request so that it does not help with this issue.
@jba Sorry if it's a stupid question but how would a session storage help with this issue? If the client initializes with host A and later on communicates with host B (that does not know about the session from host A), would B then create a new session from that id?
@ln-12 Not stupid at all. I misunderstood the question.
As far as I can tell, this is out of scope for the spec.
Alright, then a possible solution for now seems to be to only run the server on one host.
Thanks for the clarification!
@jba From this thread on the typescript sdk I found this documentation that described multi node deployment configurations. For my current use case, it would be enough to turn off the session id completely ("Stateless Mode") as we just forward the tool calls to some backend APIs and merge the results into the tool result.
Is there a way to turn session handling off in the Go sdk as well? In the typescript sdk this is possible via the following parameter as stated in the documentation (the Rust sdk and mcp-go also support it):
To enable stateless mode, configure the StreamableHTTPServerTransport with:
sessionIdGenerator: undefined
And would the second approach ("Persistent Storage Mode") be possible with the changes you are currently making?
I played around with the source code and got a seemingly working solution for which I created a draft: https://github.com/modelcontextprotocol/go-sdk/pull/159
For the quick testing I did, it seems to work as intended. However, please let me know what I might need to consider as well (if you want to support it).
@ln-12 is your use case compeltely stateless?
thanks for providing the link to the discussion https://github.com/modelcontextprotocol/typescript-sdk/issues/330, it is ultimatelly the same problem we are trying to solve.
Unfortnuately we can not go with stateless mode, so either persistent session (like they did here https://github.com/modelcontextprotocol/typescript-sdk/tree/main/src/examples#persistent-storage-mode) or maitaining trully open connection which is quite challenging to achieve in typical multi tenant enterprise environments with load balancers and mesh or sticky sessions smart routing
Yes, we currently only have some kind of search tool which means all information needed is provided through the arguments of the tool.
For now, we deployed my fork of this library with my changes from #159 included and it works fine as far as we can see. But in the long term, supporting the persistent storage mode would also be preferred over the distributed message routing by us.
This is also important for distributed MCP servers implementing the SSE transport. If GET /mcp/sse gets routed to mcp-server-instance-a and the subsequent POST /mcp/messages is routed to mcp-server-instance-b, the session lookup will fail in the same way:
https://github.com/modelcontextprotocol/go-sdk/blob/a5aa370ea07c3dc28d353a53f8fc49ed6ae1fd92/mcp/sse.go#L168-L173
This problem could be mitigated with sticky sessions routing, but not all deployments will have that option.
I think a solution could be a SessionStorage interface. Maybe in-memory would be a good default implementation, and developers could implement their own (leveraging their DB of choice) for distributed scenarios?
I would still prefer the option to turn session validation off. In our scenario, we don't need it at all and also don't want to maintain a DB for it. Currently we use these changes which did not yet lead to any issues.
👋🏻 I've been looking into what's required to move https://github.com/github/github-mcp-server and our hosted offering of that off https://github.com/mark3labs/mcp-go and on to this SDK, and this is the main thing stopping us from doing so.
Exposing a session storage interface gets us most of the way, but theres still an issue of hydrating the underlying ServerSession from the transport with the initialized state, as at the moment that is internal to the ServerSession. Without that the subsequent calls where an existing sessions is not already registered with the server get rejected for not being initialized.
I opened a draft PR with the changes I needed to get the remote GitHub MCP server working with this, and quickly put together an example repo of using this with a backing Redis store, but the solution of just exporting Initialize and requiring implementing sessions stores to call that feels less than ideal.
@omgitsads your example repo is a 404 for me.
I started on a SessionStorage interface as well -- mine works with SSE and Streamable, but otherwise they look pretty similar: https://github.com/modelcontextprotocol/go-sdk/pull/215.
I think session storage is just the start -- we probably need some changes to enable sending messages between server instances as well.
@joshwlewis whoops, it was private, it's public now. Thanks!
I took a look at your PR and implemented it on a branch in that repo but it suffers from the same issue: When the session store needs to create a new transport for an existing session and not just retrieve an already registered transport, i.e. when you restart the server or a request comes into a server that did not originally create the session, the underlying ServerSession is not marked as Initialized and requests cannot be fulfilled until it is.
I think session storage is just the start -- we probably need some changes to enable sending messages between server instances as well.
Perhaps I'm wrong, but I think EventStore would allow for messages to be sent from one server to another with a database backed event store.
Thanks @omgitsads and @joshwlewis! We agree that a SessionStorage is necessary here, and would welcome a design for the storage API. We had some discussions about distributed sessions last week, and made the following observations, independent of the storage interface:
- We need a way to tell the transport layer about MCP session state, and don't think the transport should be peeking at messages: the
ServerSessionshould control state. Note that there is more state than just theInitializeParams: log level and resource subscriptions are other examples of state. Our rough idea is to add a serializableSessionStatetype, and anUpdateSessionmethod toConnectionthat is called whenever the MCP session state changes. Server.Connectneeds to be updated to accept an initialSessionState.- The only way to identify sessions is by their session ID: even if the streamable handler introduces e.g. cookies, the spec only guarantees that session id must be preserved across requests from the client. For this reason, it might make sense for us to expose a way for server authors to provide their own session ID generator.
- If stateful servers are going to support stream resumption, they'll also need to implement a custom
EventSource. We should expose this on handler options.
I will try to turn this into a formal proposal on Monday.
@ln-12 yes, we definitely need to support 'stateless mode', discussed in #10. That's another way to implement distributed servers, but we still need to allow for stateful distributed servers.
See #232.
@jba's PR contains a sketch of the design. I'll comment there (and you are all welcome to do so as well), and once we've agreed between ourselves, let's promote this to a proposal and write up the new API in the top comment.
Per the contribution guidelines, we'll need to hold the proposal open for a week, but that need not block us from merging the proposed API (though we can't release it).
I like the API in #232, particularly that it avoids the UpdateSession method by delegating session management to the server rather than transport.
Concretely, the proposal is to add the SessionState, SessionStore, and SessionOptions types, add SessionOptions to Server.Connect and to StreamableHTTPOptions, and remove SessionID from the Connection interface, as I don't believe it is necessary anymore:
type StreamableHTTPOptions struct {
...
// If set, SessionStore is used to persist and load sessions.
SessionStore SessionStore
}
// SessionState is the state of a session.
type SessionState struct {
// InitializeParams are the parameters from the initialize request.
InitializeParams *InitializeParams `json:"initializeParams"`
// LogLevel is the logging level for the session.
LogLevel LoggingLevel `json:"logLevel"`
// TODO: resource subscriptions
}
// SessionStore is an interface for storing and retrieving session state.
type SessionStore interface {
// Load retrieves the session state for the given session ID.
// If there is none, it returns nil, fs.ErrNotExist.
Load(ctx context.Context, sessionID string) (*SessionState, error)
// Store saves the session state for the given session ID.
Store(ctx context.Context, sessionID string, state *SessionState) error
// Delete removes the session state for the given session ID.
Delete(ctx context.Context, sessionID string) error
}
type SessionOptions struct {
// SessionID is the ID for the session.
SessionID string
// SessionState is the initial state for the session.
SessionState *SessionState
// If set, SessionStore is used to persist any changes to the session.
SessionStore SessionStore
}
func (Server) Connect(context.Context, Transport, *SessionOptions) (ServerSession, error) // SessionOptions is added
type Connection interface {
...
// SessionID is removed
}
@jba does that sound right?
Yes. Although I was wondering if s/SessionOptions/ConnectOptions/, since maybe one day there will be an option for connecting that's not related to the session. I can't think of one offhand.
That's makes complete sense to me as well, with sessions split to a separate interface, keeping session ID in conn makes not much sense anymore
We've been thinking about this more, and prototyping distributed sessions, and are concerned about the following two problems:
- How can a distributed server make server->client requests, using the streamable transport? Since the subsequent POST containing the response may be routed to a different server, this doesn't seem possible.
- How are distributed sessions supposed to be cleaned up? There's no way for the DELETE to propagate to all server processes with an ongoing connection? Furthermore, server->client pings don't work because of (1), so there's no way for keepalive to work.
Therefore, we suspect that supporting distributed sessions may be a non-starter with the current spec, and distributed services that require stateful sessions will have to do so via something like sticky routing.
Fundamentally, I think we need to present a coherent bidirectional JSON-RPC stream to the server. One way to do this would be via a Connection middleware that routes through a message bus of some sort. I could prototype that.
In the meantime, I think people who need distributed servers should focus on stateless mode (being added in https://github.com/modelcontextprotocol/go-sdk/pull/277).
Thanks for the update
How can a distributed server make server->client requests, using the streamable transport? Since the subsequent POST containing the response may be routed to a different server, this doesn't seem possible.
We partially discussed it here https://github.com/modelcontextprotocol/go-sdk/pull/232#issuecomment-3176720800, essentially, according to the protocol spec now, sse is optional, some servers may not support at all, while some may figure out the potential workarounds for managing streams in distributed envs, but fundamentally it requires flexibility and fallbacks, just like with this session storage interface, it does not force anyone to override it, just provides flexibity, is not it?
How are distributed sessions supposed to be cleaned up? There's no way for the DELETE to propagate to all server processes with an ongoing connection? Furthermore, server->client pings don't work because of (1), so there's no way for keepalive to work.
It makes sense, however, in distributed environment this problem does not go anywhere even if sessions are not distributed, right? Clients may not send DELETE call and otherwise just drop the connection and initiate a new session which will land in different container
Fundamentally, I think we need to present a coherent bidirectional JSON-RPC stream to the server. One way to do this would be via a Connection middleware that routes through a message bus of some sort. I could prototype that.
Can you please elaborate on this one? does it mean that the SDK at the moment will support only streamable requests?
@BC-ACherednichenko thanks.
We can perhaps make it an implementation restriction that any server->client calls fail in stateless or distributed mode. We certainly need to implement this for stateless mode. As long as they fail immediately with a meaningful error, that's OK.
I was thinking about the keepalive problem in the context of https://github.com/modelcontextprotocol/go-sdk/issues/258, where we're talking about how to clean up sessions. I suppose as long as the implementor of a session store is willing to take responsibility for this clean up, it could still work.
Nevertheless, I'm reluctant to be the first SDK to implement this functionality, because there's a risk that the spec formalizes some other mechanism for distribution. That's why I'm proposing that we put this on ice for a little while.
We can perhaps make it an implementation restriction that any server->client calls fail in stateless or distributed mode. We certainly need to implement this for stateless mode. As long as they fail immediately with a meaningful error, that's OK.
This option sounds reasonable to me, if we could continue with the interface for sessions I think it will be a reasonable compromise
I am still a bit unsure if the protocol specification itself will ever dictate how the low level state handling will actually happen in SDK itself, that's more like an implementation detail outside of protocol. Similar discussions has been taking the place here as well already https://github.com/modelcontextprotocol/modelcontextprotocol/pull/1003#issuecomment-3120833130
👋🏻 Thanks for the update @findleyr.
How can a distributed server make server->client requests, using the streamable transport? Since the subsequent POST containing the response may be routed to a different server, this doesn't seem possible.
I feel like this might be conflating two issues here, distributing session state so that it can be hydrated between requests, and coordinating requests & responses between servers.
I agree that it's not super clear how this would be accomplished from the spec level, but wouldn't this be mapped to the request ID of the server->client request. I.e. as part of a tool call, the server sends a Sampling request to a client with ID 123, client POSTs back with id 123 and this message is routed back to the server awaiting messages for this request ID.
I assume this is what you were thinking with the message bus suggestion, where requests coming into one server are placed on the message bus for a specific request/stream id and picked up by the server that has that stream open.
My initial assumption was that EventStore was going to be the request/response message bus to handle sending requests from a receiving server to a server awaiting a response, but looking deeper into that, that seems to be more related to SSE stream resumption?
In the meantime, I think people who need distributed servers should focus on stateless mode (being added in #277).
For context, in the GitHub MCP server we are primarily using session state for observability, without session state there is no easy way to attribute subsequent requests/tool calls etc from a client within a chat session. The initialize request happens then everything else after that are tool calls/requests from a user with no grouping.
While we could use stateless mode for all our current tools/prompts/resources etc, this observability gap is what we're interested in addressing with this distributed session state.
Given the seemingly separate nature of these two problems, I believe that the distributed session state storage could still move forward independent of the request coordination issues to address this issue, while solving the request/response coordination problem separately.
From https://github.com/modelcontextprotocol/go-sdk/pull/232#issuecomment-3176787124, it was mentioned that you'd like to wait for clarity in the spec for this, is that discussion happening anywhere publically that we can follow?
Similar to @BC-ACherednichenko, I'm skeptical that the spec will dictate how session storage & recovery will work in particular, but happy to be proven wrong if that moves us forwards.
For context, in the GitHub MCP server we are primarily using session state for observability, without session state there is no easy way to attribute subsequent requests/tool calls etc from a client within a chat session. The initialize request happens then everything else after that are tool calls/requests from a user with no grouping.
@omgitsads potentially stateless can be possible to leverage for your goals. We tried to apply the following approach -
- via stateless you disable session ID validation inside SDK
- then wrap
func (*StreamableHTTPHandler) ServeHTTP(w http.ResponseWriter, req *http.Request)into the separate middleware around the main handler - manage session state in the middleware level by manually extracting / attaching
mcp-session-idto req/resp headers. it should be possible to sync it to redis or etc as long as you are able to attach the session id to headers
Nevertheless, right now, stateless assumes that mcp-session-id is never passed, as the result once you try to trick this header in the middleware stateless mode is immediatelly dismissed https://github.com/modelcontextprotocol/go-sdk/pull/277#issuecomment-3181460339 and you again have issues with 404 session not found error, so in fact stateless mode can not be leveraged for this purpose since SDK is very strict on how mcp-session-id is used.
Essentially I also came to the conclussion that if you want to have "spec compatible" request coordination in distributed env with multiple containers, mcp-session-id is basically the only one tool at your disposal which forced to be preserved by the client by the MCP protocol itself, no other custom headers are guranteed to be echoed on MCP protocol level, and since SDK is pretty presciptive on both sessions and the header manipulations there are no options on how you could extend the SDK to make it work for similar cases.
@omgitsads do you also not maintain true persistent long running connection in your particular case, so each subsequent request in session may land on any container in a cluster? I really feel that most of discrepancies are coming from the fact that streamable HTTP implementation does not assume such setup, but, according to the protocol iteself it is may be optional if I read it correctly
Thanks @omgitsads and @BC-ACherednichenko -- I hear the need to preserve a session ID across an otherwise-stateless session.
Today, I'm going to prototype this end-to-end. I'm mostly concerned with ensuring that server->client requests fail immediately rather than hang waiting for a response they will never see.
Ok, in https://github.com/modelcontextprotocol/go-sdk/pull/307 I think I have a model that will work for a distributable server, if you don't care about state.
It adds a Stateless option in addition to GetSessionID. You may have a non-trivial session ID and still be in stateless mode. Furthermore, it turns of session validation in stateless mode, and fixes some bugs around initialization.
It still needs to be cleaned up, and more tests are needed, but I think it provides a workable solution for the primary use case described above: a stateless server that still has access to a non-empty session ID across requests.
Ok, with #307, I think there is a solution for people who want to implement distributed servers where the MCP session is stateless, but there may be state in the application, keyed by session ID. Therefore, I'll move this out of the release milestone: we may want to do more here, but IIUC the immediate need is addressed.