python-sdk icon indicating copy to clipboard operation
python-sdk copied to clipboard

fix: handle internal notifications during session cleanup

Open donghao1393 opened this issue 1 year ago • 1 comments

fix: handle internal notifications during session cleanup

Addresses an issue where internal notifications (e.g. 'cancelled') during session cleanup would trigger validation errors and crash the server.

Changes

  • Add session state tracking via _closed flag
  • Skip validation for internal notifications during cleanup phase
  • Improve progress context cleanup with proper async context manager
  • Add final_progress notification to ensure completion state is reported

Testing

Successfully tested with multiple concurrent requests with progress tracking:

  • Cleanup proceeds without validation errors
  • Progress notifications complete properly
  • No server crashes during shutdown

donghao1393 avatar Dec 02 '24 17:12 donghao1393

Hi @dsp-ant , thank you for your advice. I have made commit. Would you help to check again?

donghao1393 avatar Dec 17 '24 12:12 donghao1393

Technical Details

  • The issue occurred when internal notifications were received during cleanup
  • Added _closed flag to track session state accurately
  • Modified notification handling to be cleanup-aware
  • Ensures graceful shutdown even with pending notifications

Impact

  • Prevents validation errors during shutdown
  • Maintains session stability during cleanup
  • No interference with normal operation
  • Compatible with existing notification patterns

This approach keeps the fix focused on the core issue while maintaining backward compatibility.

donghao1393 avatar Jan 03 '25 16:01 donghao1393

@dsp-ant @donghao1393 - if I may suggest a more straight-forward fix for this inline with codebase patterns? - https://github.com/txbm/python-sdk/commit/19185626d57fdd670219784d10859c7d0a302c2e

txbm avatar Jan 06 '25 06:01 txbm

@dsp-ant @donghao1393 - if I may suggest a more straight-forward fix for this inline with codebase patterns? - txbm@1918562

I have tested this commit on my local and it's working. I think this is a better solution. It fixed type definition for cancelled notifications to match protocol implementation. This allows proper handling of cancellation messages without requiring changes to the notification processing logic. What do you think @dsp-ant ?

donghao1393 avatar Jan 06 '25 15:01 donghao1393

Please can you merge this fix in, it's breaking my server on claude desktop app - thank you!

orliesaurus avatar Jan 07 '25 07:01 orliesaurus

Hey @donghao1393 - It seems to me that this can be simplified a lot now that there is the canceled notification types.

I'm going to be doing a loop on notifications in this sdk soon, and for my understanding (and to write some tests) I'd love it if you could also provide an example that reproduces this issue? would it just be a server with a long running task and the client sending of a cancel notification?

Yes, you're right. A reproduction case would be a server with a long-running task (like image generation) that typically takes more than a few seconds, and the client sending a cancel notification during the process. I can provide a minimal example:

  1. add this of my mcp service
  2. A client that initiates the task and sends a cancel notification, like:
    • 1st prompt: "create an image using dall-e model to depict a dog standing on the moon watching the earth"
    • 2nd prompt (in the same conversation): "create another image that the dog was running"
    • expected: the conversation should return both 2 images
    • problem: if the cancel notification made the service down in 1st prompt, the 2nd prompt would be failed to send request
  3. inspect the log file mcp-server-openai.log so that we can observe the cancellation behavior

This would help validate both the current behavior and any proposed changes.

donghao1393 avatar Jan 18 '25 16:01 donghao1393

Fixed on my side using this patch + extra glue on MCP 1.2.0. We need the cancellation notification. Because some time Claude is a bit wild and I noticed if I press cancel and a tool is triggered, it not working. We need to implement properly cancellation that would stop the tool in a graceful way and don't crash server.

Mehdi-Bl avatar Jan 20 '25 21:01 Mehdi-Bl

@dsp-ant @donghao1393 - if I may suggest a more straight-forward fix for this inline with codebase patterns? - txbm@1918562

I'm going to roll with this until the issue is closed, thanks @txbm!

EduPonz avatar Jan 29 '25 15:01 EduPonz

Hi @jerome3o-anthropic , I have updated accordingly and passed my test. Would you help to take a review for the next?

donghao1393 avatar Feb 05 '25 17:02 donghao1393

I am curious if you could merge this? @dsp-ant Maybe it will fix my issues with The read operation timed out

orliesaurus avatar Feb 11 '25 18:02 orliesaurus

It seem the issue is fixed, either Claude Desktop update or using the latest main branch. As already I see cancelled got added as valid type, so should no more throw panic error and crash.

Mehdi-Bl avatar Feb 11 '25 19:02 Mehdi-Bl

It seem the issue is fixed, either Claude Desktop update or using the latest main branch. As already I see cancelled got added as valid type, so should no more throw panic error and crash.

I have tested the latest SDK without this PR, but it failed in my case.

donghao1393 avatar Feb 11 '25 19:02 donghao1393

I have used the main branch. It's not in 1.2.1. For me this seem solving it. https://github.com/modelcontextprotocol/python-sdk/blob/f10665db4c2f676da1131617ad67715952258712/src/mcp/types.py#L995 I testing 1 min task withtout issue when I moved using the new main branch. Previously would crash. I had even similar patch as your and removed it. Can you test the main branch not 1.2.1.

Mehdi-Bl avatar Feb 11 '25 20:02 Mehdi-Bl

I have used the main branch. It's not in 1.2.1. For me this seem solving it.

https://github.com/modelcontextprotocol/python-sdk/blob/f10665db4c2f676da1131617ad67715952258712/src/mcp/types.py#L995

I testing 1 min task withtout issue when I moved using the new main branch. Previously would crash. I had even similar patch as your and removed it. Can you test the main branch not 1.2.1.

Thank you for your advice. I tried with

# disable old mcp sdk
uv add git+https://github.com/modelcontextprotocol/python-sdk --branch main
# restart claude desktop

And then it works.

donghao1393 avatar Feb 11 '25 20:02 donghao1393

I have used the main branch. It's not in 1.2.1. For me this seem solving it.

https://github.com/modelcontextprotocol/python-sdk/blob/f10665db4c2f676da1131617ad67715952258712/src/mcp/types.py#L995

I testing 1 min task withtout issue when I moved using the new main branch. Previously would crash. I had even similar patch as your and removed it. Can you test the main branch not 1.2.1.

if it's not in 1.2.1 is it going to be out in 1.2.2 or what's the logic you think? EDIT: Are they going straight to 1.3?

orliesaurus avatar Feb 12 '25 00:02 orliesaurus

I'm not in the team. So @dsp-ant or @jerome3o-anthropic may help here. But this issue was a bigfpain for using MCP. When MCP crash mid conversation, you have to close Claude Desktop. And restart that step, or wait for it to timeout to let it see the issues. Some time it in the middle of modifications. Aside from the fact that some commands are impossible to run or blow out randomly when they reach the 10s. I rewrote my MCP more than once thinking the issue on my side. So it should had been higher priority and released in 1.2. or 1.2.1. I feel MCP SDK will remain not mature in a while as it's an ambitious angoing work.

Mehdi-Bl avatar Feb 12 '25 16:02 Mehdi-Bl