adk-go icon indicating copy to clipboard operation
adk-go copied to clipboard

mcptoolset: cached MCP session not validated, causes "connection closed" errors on subsequent requests

Open mikaelhatanpaa opened this issue 1 month ago • 8 comments

Describe the bug

The mcptoolset package caches the *mcp.ClientSession after the first connection and never validates if it's still alive before returning it. For HTTP/SSE-based MCP servers (using StreamableClientTransport), sessions can become stale due to server-side session timeouts, network interruptions, or idle connection cleanup.

When the cached session becomes stale, all subsequent requests fail with errors like:

connection closed: calling "tools/list": client is closing: standalone SSE stream: failed to reconnect (session ID: ...): connection failed after 5 attempts

The first request always works, but subsequent requests fail.

To Reproduce

  1. Install ADK: go get google.golang.org/[email protected]

  2. Create an MCP server using the Go MCP SDK with HTTP transport:

import mcpsdk "github.com/modelcontextprotocol/go-sdk/mcp"

srv := mcpsdk.NewServer(&mcpsdk.Implementation{Name: "test-server", Version: "1.0.0"}, nil)
handler := mcpsdk.NewStreamableHTTPHandler(func(r *http.Request) *mcpsdk.Server {
    return srv
}, &mcpsdk.StreamableHTTPOptions{
    SessionTimeout: 5 * time.Minute,
})
// Serve on http://localhost:8080/mcp
  1. Create an ADK agent that connects to this MCP server via HTTP:
transport := &mcp.StreamableClientTransport{
    Endpoint: "http://localhost:8080/mcp",
}
toolset, _ := mcptoolset.New(mcptoolset.Config{
    Transport: transport,
})
agent, _ := llmagent.New(llmagent.Config{
    Name:     "test-agent",
    Model:    model,
    Toolsets: []tool.Toolset{toolset},
})
  1. Send a request to the agent → works
  2. Wait a moment (or let the session timeout)
  3. Send another request → fails with "connection closed" error

Root cause

From tool/mcptoolset/set.go:

func (s *set) getSession(ctx context.Context) (*mcp.ClientSession, error) {
    s.mu.Lock()
    defer s.mu.Unlock()
    if s.session != nil {
        return s.session, nil  // ← Always returns cached session without checking if still valid
    }
    session, err := s.client.Connect(ctx, s.transport, nil)
    // ...
    s.session = session
    return s.session, nil
}

Expected behavior

The mcptoolset should handle stale sessions gracefully by checking if the session is still valid before returning the cached session, and clearing it to trigger reconnection if it's stale.

Suggested fix

func (s *set) getSession(ctx context.Context) (*mcp.ClientSession, error) {
    s.mu.Lock()
    defer s.mu.Unlock()

    if s.session != nil {
        // TODO: Check if session is still valid before returning
        // isSessionValid() is to be implemented - could use Ping, check connection state, etc.
        // if !isSessionValid(s.session) {
        //     s.session.Close()
        //     s.session = nil
        // } else {
        //     return s.session, nil
        // }
        return s.session, nil
    }

    session, err := s.client.Connect(ctx, s.transport, nil)
    if err != nil {
        return nil, fmt.Errorf("failed to init MCP session: %w", err)
    }

    s.session = session
    return s.session, nil
}

Screenshots

N/A

Desktop (please complete the following information):

  • OS: macOS (Apple Silicon)
  • Go version: 1.23
  • ADK version: v0.2.0

Model Information:

  • gemini-2.5-flash (but the issue is transport-related, not model-related)

Additional context

  • This issue only affects HTTP-based MCP servers (using StreamableClientTransport).
  • Workaround: For same-process scenarios, using in-memory transport (mcp.NewInMemoryTransports()) avoids the issue entirely.

mikaelhatanpaa avatar Dec 05 '25 15:12 mikaelhatanpaa

@dpasiukevich , could you confirm if the proposed approach (checking session validity/clearing stale sessions) and let us know if is the right direction for the PR?

anish123subedi avatar Dec 06 '25 06:12 anish123subedi

I ran into the same issue over the weekend, and I believe the underlying problem may be that the context passed to connect is being cancelled. See PR https://github.com/google/adk-go/pull/406

@dpasiukevich could you have a look?

Nerja avatar Dec 08 '25 20:12 Nerja

Thanks for raising this!

I believe we should solve 2 issues here:

  1. As pointed in https://github.com/google/adk-go/pull/406 PR, the session is created using the request-scoped context passed to Tools(ctx).
  2. Server may close the connection for any other reason so we'd still need to verify if the session is valid.

dpasiukevich avatar Dec 11 '25 12:12 dpasiukevich

@dpasiukevich if you want, then I can add a Ping implementation to my PR as well or it can come as another PR. Up to you.

Nerja avatar Dec 11 '25 16:12 Nerja

@dpasiukevich I just realized that https://github.com/modelcontextprotocol/go-sdk/pull/655 will let us get the error code from a Ping request, which is important since Ping implementation is optional. To keep things moving, we could merge the current PR now and add Ping in a follow-up once 655 is released. What do you think?

Nerja avatar Dec 11 '25 22:12 Nerja

Yes, there's also https://github.com/modelcontextprotocol/go-sdk/pull/719 as well which exposes ErrSessionMissing.

And I see there's already exposed mcp.ErrConnectionClosed.

Folks WDYT if we'd follow this approach:

MCPToolset right now calls session.ListTools and session.CallTool. If they succeed -> we're good If err -> we check if it's mcp.ErrConnectionClosed -> recreate a session -> run ListTools/CallTool again. Maybe even to add some retry count to avoid potential loop when it will always recreate a session.

dpasiukevich avatar Dec 11 '25 22:12 dpasiukevich

I did some experiments. Seems like errors.Is(err, mcp.ErrConnectionClosed) is handling this (either when client's context is cancelled or server terminates the session).

dpasiukevich avatar Dec 12 '25 14:12 dpasiukevich

Hi @dpasiukevich — I dug a bit deeper into the issues:

Regarding (1), this is already addressed on the main branch of modelcontextprotocol/go-sdk (see PR #677). The fix is not yet released, and since we’re currently on v0.7.0, we’ll pick it up once a new version is published and we use it.

Regarding (2), This still requires an explicit health check (e.g., a Ping). I suggest following the simple Ping approach from google/adk-go#417.

Behavior once the new go-sdk version is released and used: The updated SDK will allow us to detect when Ping is not implemented (via the CodeMethodNotFound handling in the JSON-RPC layer) and skip reconnecting in that case. Until PR #677 is included in a release and used in adk-go, Ping will continue to fail due to issue (1), so we’ll reconnect. Once the fix is available and we update, we should stop reconnecting for that scenario.

If we go with that approach, then I think that we can skip https://github.com/google/adk-go/pull/406 and go with https://github.com/google/adk-go/pull/417 + updating the modelcontextprotocol/go-sdk dependency when a new version is out.

Let me know what you think. If you think that it sounds good, then I'll do some minor fixes in https://github.com/google/adk-go/pull/417.

A bit interesting issue 🙂

Nerja avatar Dec 12 '25 16:12 Nerja