flask-uwsgi-websocket
flask-uwsgi-websocket copied to clipboard
Remove `handler.join` call from closed gevent WebSocket tidyup
It's possible that there's more to do here, but I'm unclear on the overall picture. Concretely, I've found that as currently implemented in master, over time all greenlets become tied up with closed connections. That is: if I run uWSGI with two greenlets, open a websocket, close it, open another, close it, I can't open a third: the async queue is full.
I've identified this line as where things get stuck: the handler.join
call simply never returns. As a consequence, we never reach the following return ''
which means that the greenlet servicing the WebSocket is never released, which means that over time all of one's greenlets become tied up with dead connections.
Now, I'm unclear precisely what the join
is doing: even with the most trivial test case I can devise, it just never gets hit, so I'm not sure what the consequences of removing it are? I assume that the idea is to ensure that the handler greenlet doesn't float about like a little green zombie? Perhaps an alternative would be to call handler.kill
?
In any case, would love some guidance on what ought to be done to ultimately remove this line: I'm pretty certain that the problem I'm running into is a general one.
OK, I looked at this some more and thought about it some more and understood the problem: concretely, unless you manually do
if not connection.connected:
break
in your while True
loop in your handler, join
will inevitably never return, because, well, you'll never break out of the loop.
Now, handler.kill
is kinda gross, because it means that if someone has some sort of cleanup code that they want to have executed after breaking out of the while loop, then that code won't get executed... but it seems like it is always going to be problematic to guarantee that any such code will be executed, so I'm not sure that should be a concern: I mean, we could set a timeout on the join
call, but then even if you were manually checking for connection.connected
and then breaking out to hit the clean up code, it is still possible that you wouldn't reach the clean up section, so there's no guarantee there.
So I think that would be the best approach. If for reasons I'm not aware of you'd prefer not to change this line, then it would be a very good idea to add something to the readme that makes it clear that you need to continually check the status of the connection lest your greenlets be used up on closed connections...
I'm still not sure I understand the consequences of removing this, and am considering other alternate solutions to allow for more robust handling of things. I kind of like manually checking if not connection.connected
, as it's very explicit and allows you a lot of flexibility about how to handle that situation.
What I do in most of the examples is wait for None
to get added to the queue, which I use as a sentinel.
while True:
msg = ws.receive()
if msg is None:
break
I'm surprised Greenlet.join
just hangs forever, it should only hang as long as client.timeout
.
At any rate I need to play around with things a bit, thanks for bringing this to my attention and documenting your problem :)
This isn't the easiest project to test, sadly, but if you were interested in setting up a test case for your problem that would be amazingly helpful.
Sure, I'll have a think about it. FWIW, I also discovered that checking connection.connected
in the application level code is not entirely sufficient, because the following sequence of events can happen:
- Within the handler greenlet, we do either
msg = connection.receive
. This blocks execution on the handler greenlet until there's a message in the queue, and yields to allow other greenlets to make progress. - At this point, we resume execution at
ready = wait([handler, send_event, recv_event], None, 1)
. - We attempt to send a message. We get an
IOError
, so we set connection.connected toFalse
. At this point, since the handler greenlet yielded because it was waiting for a message, and we were just trying to send a message, then it is still going to be blocked: there isn't a message for it to get on the queue. - We go back to the top of the
while True
loop, notice that the connection isn't connected, andjoin
to the handler greenlet. But that handler greenlet is stuck waiting for areceive
event that will never come, because we've killed the listener thread (and even if we hadn't, it's never going to receive a message, since the connection is closed). And so we never return a value :(.
I think this is a correct summation, but I could be missing something...
Based on this, it is possible that there's a way to judiciously use yield
in _gevent.py to avoid this... but I'm not 100% sure right now: there might be additional similar edge cases that get us into trouble that I haven't thought of yet.
More generally, even if there's a solution which prevents this happening when making calls to methods defined in this library, it's still going to be possible for users to write code which will infinitely block. Of course, one might reasonably say that users ought not to do so, but I suspect that many users might not fully grasp the implications of potentially infinitely blocking code in a while True
loop: they might (perhaps irrationally), actually expect the loop to be forcibly disrupted in the event of a broken connection.
All that being said, you're right that the timeout ought to mitigate against this, so maybe I'm going crazy and it is something else. And of course, you do put a None
on the receive queue, so I suppose that shouldn't be quite right, either...
All the more reason we need a test suite to verify anything works as expected :)