websockets icon indicating copy to clipboard operation
websockets copied to clipboard

Sometimes ClientConnection.send() never returns with the asyncio client

Open parmentelat opened this issue 1 year ago • 8 comments

hiya there

some background for starters recently as part of an upgrade to fedora41 I have come across release 14 of websockets so I have taken that chance to upgrade my code to use this new API turns out it is used in a monitoring process, that acts as a ws client to push its results upstream; and I am now facing a nasty sneaky issue with that, because after some unpredicatable amount of time - typically hours - the monitor stops looking at some nodes in a seemingly random way; that is to say, its focus shrinks overtime, and it reports the status of fewer and fewer nodes.

admittedly there have been many moving pieces since last time where this code was working fine however after a week patiently adding debugging instructions, I have been able to nail it down a bit and I have evidence that in my application, it is the call to websockets.asyncio.client.ClientConnection.send() that essentially never returns; like in, never ever even after hours specifically I am using 14.1, and the websockets.asyncio.client flavour

so, that's all I have as hard evidence for now I am still miles away from being able to reproduce on a simple example - I can't even do that reliably in my complex app, I only know to wait for the bug to manifest itself...

at this point I have opted for rolling back to using websockets.legacy instead, still with 14.1 and so far this has been performing all right for a full day, so I'm optimistic my issue is fixed with that older API

other than that: as far as I can tell - but that's probably not that far - the first suspect could be this pair of lines here

https://github.com/python-websockets/websockets/blob/0403823185b272ae389da1c9182773932b4df950/src/websockets/asyncio/connection.py#L451-L452

but of course I don't know that for sure, I reckon this is just a wild guess

so, that's where I am at right now, I thought this was still worth reporting, so here I am

thanks for any hint or insight wrt to this situation

parmentelat avatar Dec 08 '24 16:12 parmentelat

Oh, how annoying :-(

Let's rule out unlikely causes:

  1. You aren't sending fragmented messages i.e. calling send() with an iterable or async iterable, are you?
  2. You kept close_timeout set to the default value or set a reasonable value, didn't you?
  3. You aren't sending large enough amounts of data to create a risk of clogging the correction, are you?

Then, if I had to bet:

  1. async with self.send_context(): ... could be blocking during the exit of the context manager. In that case, when sending a text message, this line would execute but you'd never exit the context manager and jump to the next line.
  2. You could be losing the TCP connection in a way that websockets doesn't handle well. In order to prevent leaking TCP connection, websockets waits for the TCP connection to be lost before returning. It took a long time to iron out all potential issues in the original implementation; I wouldn't be surprised if the rewrite had some bugs; the I/O layer is brand new and more difficult to get right than the Sans-I/O layer.

aaugustin avatar Dec 08 '24 21:12 aaugustin

Thank you for reporting the issue BTW.

aaugustin avatar Dec 08 '24 21:12 aaugustin

hiya there again

many thanks for the quick feedback :)

none of the 3 unlikely causes seem to apply to me; I pass send() a good-old str object; I don't mess with the timeouts; my chunks are pretty small, in the order of 1k chars or similar between the 2 other I could not say which is the most likely in my case; it feels however like my issues might be somehow related to the server end being restarted in some way, I have no hard evidence about that but it's possible

I'm unsure how to go about helping further, insofar as I have kind of several times overburnt my anticipated time budget on this topic :) the service based on legacy has been running for 36 hours now without a glitch, so I guess for now I'm going to stay put and leave things as they are

let me know if you do any potential progress around this issue, I will try to make it simple for me to try out any future version down the line

thanks again for the insights, much appreciated !

parmentelat avatar Dec 09 '24 09:12 parmentelat

Thank you for the information you provided. I'm going to look into it. Probably this will get ironed out long before the deprecation of the legacy implementation completes (5 years). I'm sorry for the rocky migration experience.

aaugustin avatar Dec 09 '24 17:12 aaugustin

There a small chance that this is a different symptom of #1555, although I don't have a full theory supporting that.

aaugustin avatar Feb 08 '25 12:02 aaugustin

thanks for the heads up - I'm of course still interested in any further progress :)

parmentelat avatar Feb 08 '25 12:02 parmentelat

I'm not sure if this is part of the problem, but I noticed that in the legacy version, WebSocketCommonProtocol.send first checks whether the connection state is still open before performing type matching, whereas Connection.send does not (could check self.protocol.state, right?). That said, I might be completely wrong here, so feel free to disregard this if I’ve missed some underlying implementation detail.

Electron4444 avatar Jul 01 '25 11:07 Electron4444

Protocol state is checked at a lower level, in the Sans-I/O layer, specifically here:

https://github.com/python-websockets/websockets/blob/fc7cafea01ebe3ec1738160ac62889fd80653255/src/websockets/protocol.py#L315-L316

aaugustin avatar Jul 01 '25 11:07 aaugustin