python-sdk icon indicating copy to clipboard operation
python-sdk copied to clipboard

ClientSession Error Handling

Open Unshure opened this issue 3 months ago • 2 comments

Question

I work on the Strands SDK, and we have an integration with MCP where we invoke tools through an MCP client.

Our design spins up a thread, creates an asyncio event loop, and schedules a task to create a session with an MCP server using the python-sdk mcp ClientSession: https://github.com/strands-agents/sdk-python/blob/main/src/strands/tools/mcp/mcp_client.py#L391

Then, when an LLM decides to call a tool, or multiple tools, we schedule additional tasks on this thread's event loop to call the mcp server: https://github.com/strands-agents/sdk-python/blob/main/src/strands/tools/mcp/mcp_client.py#L324-L330

We ran into an issue where when the mcp streamablehttp_client sse_read_timeout was lower than the time it took for a tool to return, the tool invocation tasks would hang due to an exception that was not propagated out of the ClientSession. We see an exception message (included in the additional context section), but the stack trace never enters our code, it stops in the mcp code.

After investigating further, I found that this exception is sent through the MCP ClientSession message_handler, and in the default implementation of this message handler, an exception is never raised: https://github.com/modelcontextprotocol/python-sdk/blob/71889d7387f070cd872cab7c9aa3d1ff1fa5a5d2/src/mcp/client/session.py#L57-L60

To work around this issue, I have introduced my own message_handler to raise an exception passed to it: https://github.com/strands-agents/sdk-python/pull/922

I wanted to know why these exceptions are not raised in the MCP ClientSession? This silent failure took a long time to debug, and I want able to find any documentation on this behavior. As a user of the ClientSession, I would expect these exceptions to be raised so that my code can handle them.

Additional Context

Error stack trace:

Error reading SSE stream:
Traceback (most recent call last):
  File "/Users/ncclegg/Library/Application Support/hatch/env/virtual/strands-agents/X7vsTQrp/strands-agents/lib/python3.13/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions
    yield
  File "/Users/ncclegg/Library/Application Support/hatch/env/virtual/strands-agents/X7vsTQrp/strands-agents/lib/python3.13/site-packages/httpx/_transports/default.py", line 271, in __aiter__
    async for part in self._httpcore_stream:
        yield part
  File "/Users/ncclegg/Library/Application Support/hatch/env/virtual/strands-agents/X7vsTQrp/strands-agents/lib/python3.13/site-packages/httpcore/_async/connection_pool.py", line 407, in __aiter__
    raise exc from None
  File "/Users/ncclegg/Library/Application Support/hatch/env/virtual/strands-agents/X7vsTQrp/strands-agents/lib/python3.13/site-packages/httpcore/_async/connection_pool.py", line 403, in __aiter__
    async for part in self._stream:
        yield part
  File "/Users/ncclegg/Library/Application Support/hatch/env/virtual/strands-agents/X7vsTQrp/strands-agents/lib/python3.13/site-packages/httpcore/_async/http11.py", line 342, in __aiter__
    raise exc
  File "/Users/ncclegg/Library/Application Support/hatch/env/virtual/strands-agents/X7vsTQrp/strands-agents/lib/python3.13/site-packages/httpcore/_async/http11.py", line 334, in __aiter__
    async for chunk in self._connection._receive_response_body(**kwargs):
        yield chunk
  File "/Users/ncclegg/Library/Application Support/hatch/env/virtual/strands-agents/X7vsTQrp/strands-agents/lib/python3.13/site-packages/httpcore/_async/http11.py", line 203, in _receive_response_body
    event = await self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ncclegg/Library/Application Support/hatch/env/virtual/strands-agents/X7vsTQrp/strands-agents/lib/python3.13/site-packages/httpcore/_async/http11.py", line 217, in _receive_event
    data = await self._network_stream.read(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        self.READ_NUM_BYTES, timeout=timeout
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/ncclegg/Library/Application Support/hatch/env/virtual/strands-agents/X7vsTQrp/strands-agents/lib/python3.13/site-packages/httpcore/_backends/anyio.py", line 32, in read
    with map_exceptions(exc_map):
         ~~~~~~~~~~~~~~^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.13.5/Frameworks/Python.framework/Versions/3.13/lib/python3.13/contextlib.py", line 162, in __exit__
    self.gen.throw(value)
    ~~~~~~~~~~~~~~^^^^^^^
  File "/Users/ncclegg/Library/Application Support/hatch/env/virtual/strands-agents/X7vsTQrp/strands-agents/lib/python3.13/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ReadTimeout

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/ncclegg/Library/Application Support/hatch/env/virtual/strands-agents/X7vsTQrp/strands-agents/lib/python3.13/site-packages/mcp/client/streamable_http.py", line 326, in _handle_sse_response
    async for sse in event_source.aiter_sse():
    ...<10 lines>...
            break
  File "/Users/ncclegg/Library/Application Support/hatch/env/virtual/strands-agents/X7vsTQrp/strands-agents/lib/python3.13/site-packages/httpx_sse/_api.py", line 42, in aiter_sse
    async for line in lines:
    ...<3 lines>...
            yield sse
  File "/Users/ncclegg/Library/Application Support/hatch/env/virtual/strands-agents/X7vsTQrp/strands-agents/lib/python3.13/site-packages/httpx/_models.py", line 1031, in aiter_lines
    async for text in self.aiter_text():
        for line in decoder.decode(text):
            yield line
  File "/Users/ncclegg/Library/Application Support/hatch/env/virtual/strands-agents/X7vsTQrp/strands-agents/lib/python3.13/site-packages/httpx/_models.py", line 1018, in aiter_text
    async for byte_content in self.aiter_bytes():
    ...<2 lines>...
            yield chunk
  File "/Users/ncclegg/Library/Application Support/hatch/env/virtual/strands-agents/X7vsTQrp/strands-agents/lib/python3.13/site-packages/httpx/_models.py", line 997, in aiter_bytes
    async for raw_bytes in self.aiter_raw():
    ...<2 lines>...
            yield chunk
  File "/Users/ncclegg/Library/Application Support/hatch/env/virtual/strands-agents/X7vsTQrp/strands-agents/lib/python3.13/site-packages/httpx/_models.py", line 1055, in aiter_raw
    async for raw_stream_bytes in self.stream:
    ...<2 lines>...
            yield chunk
  File "/Users/ncclegg/Library/Application Support/hatch/env/virtual/strands-agents/X7vsTQrp/strands-agents/lib/python3.13/site-packages/httpx/_client.py", line 176, in __aiter__
    async for chunk in self._stream:
        yield chunk
  File "/Users/ncclegg/Library/Application Support/hatch/env/virtual/strands-agents/X7vsTQrp/strands-agents/lib/python3.13/site-packages/httpx/_transports/default.py", line 270, in __aiter__
    with map_httpcore_exceptions():
         ~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/opt/homebrew/Cellar/[email protected]/3.13.5/Frameworks/Python.framework/Versions/3.13/lib/python3.13/contextlib.py", line 162, in __exit__
    self.gen.throw(value)
    ~~~~~~~~~~~~~~^^^^^^^
  File "/Users/ncclegg/Library/Application Support/hatch/env/virtual/strands-agents/X7vsTQrp/strands-agents/lib/python3.13/site-packages/httpx/_transports/default.py", line 118, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ReadTimeout

Unshure avatar Sep 25 '25 14:09 Unshure

Thanks for this report - this seems like something we should fix to raise the exception for more predictable behavior.

felixweinberger avatar Oct 03 '25 14:10 felixweinberger

This is a great issue to surface — from a broader perspective, the SDK could benefit from a more consistent error taxonomy and cross-layer behavior, not just in ClientSession.

A few related pain points:

  • Different layers (transports, sessions, FastMCP, low-level server) use different patterns:
    • some raise exceptions directly,
    • others wrap in ToolError or similar,
    • others convert to ErrorData with limited structure.
  • It’s hard for clients (and LLM agents) to distinguish:
    • transient / retryable errors (network, timeouts),
    • validation errors (bad input),
    • permission / auth problems,
    • “tool blew up internally” bugs.

It might be worth expanding this issue (or creating a related one) to define:

  1. A small error taxonomy, e.g.:

    • TransientError – retryable
    • ValidationError – bad input, non-retryable
    • PermissionError – authn/authz issues
    • NotFoundError, etc.

    and how they map into ErrorData.code / data.

  2. A consistent propagation rule:

    • At protocol boundaries, always return an ErrorData response rather than raising.
    • Preserve the original exception type and message in a structured way in ErrorData.data.
    • Ensure clients can reliably inspect the error category and decide whether to retry.
  3. A minimal error code registry:

    • A short list of documented error codes with clear semantics and retry guidance.

If maintainers are interested, I’d be happy to help sketch a concrete proposal for an error taxonomy and how it would map across server, client, and transports, building on the work you’re doing here for ClientSession.

dgenio avatar Nov 28 '25 14:11 dgenio