flask-uwsgi-websocket icon indicating copy to clipboard operation
flask-uwsgi-websocket copied to clipboard

Remove `handler.join` call from closed gevent WebSocket tidyup

Open matthewowen opened this issue 10 years ago • 8 comments

matthewowen avatar Jan 20 '15 21:01 matthewowen

It's possible that there's more to do here, but I'm unclear on the overall picture. Concretely, I've found that as currently implemented in master, over time all greenlets become tied up with closed connections. That is: if I run uWSGI with two greenlets, open a websocket, close it, open another, close it, I can't open a third: the async queue is full.

I've identified this line as where things get stuck: the handler.join call simply never returns. As a consequence, we never reach the following return '' which means that the greenlet servicing the WebSocket is never released, which means that over time all of one's greenlets become tied up with dead connections.

Now, I'm unclear precisely what the join is doing: even with the most trivial test case I can devise, it just never gets hit, so I'm not sure what the consequences of removing it are? I assume that the idea is to ensure that the handler greenlet doesn't float about like a little green zombie? Perhaps an alternative would be to call handler.kill?

In any case, would love some guidance on what ought to be done to ultimately remove this line: I'm pretty certain that the problem I'm running into is a general one.

matthewowen avatar Jan 20 '15 21:01 matthewowen

OK, I looked at this some more and thought about it some more and understood the problem: concretely, unless you manually do

if not connection.connected:
    break

in your while True loop in your handler, join will inevitably never return, because, well, you'll never break out of the loop.

Now, handler.kill is kinda gross, because it means that if someone has some sort of cleanup code that they want to have executed after breaking out of the while loop, then that code won't get executed... but it seems like it is always going to be problematic to guarantee that any such code will be executed, so I'm not sure that should be a concern: I mean, we could set a timeout on the join call, but then even if you were manually checking for connection.connected and then breaking out to hit the clean up code, it is still possible that you wouldn't reach the clean up section, so there's no guarantee there.

So I think that would be the best approach. If for reasons I'm not aware of you'd prefer not to change this line, then it would be a very good idea to add something to the readme that makes it clear that you need to continually check the status of the connection lest your greenlets be used up on closed connections...

matthewowen avatar Jan 21 '15 00:01 matthewowen

I'm still not sure I understand the consequences of removing this, and am considering other alternate solutions to allow for more robust handling of things. I kind of like manually checking if not connection.connected, as it's very explicit and allows you a lot of flexibility about how to handle that situation.

zeekay avatar Jan 24 '15 06:01 zeekay

What I do in most of the examples is wait for None to get added to the queue, which I use as a sentinel.

while True:
    msg = ws.receive()
    if msg is None:
        break

I'm surprised Greenlet.join just hangs forever, it should only hang as long as client.timeout.

At any rate I need to play around with things a bit, thanks for bringing this to my attention and documenting your problem :)

zeekay avatar Jan 24 '15 07:01 zeekay

This isn't the easiest project to test, sadly, but if you were interested in setting up a test case for your problem that would be amazingly helpful.

zeekay avatar Jan 24 '15 07:01 zeekay

Sure, I'll have a think about it. FWIW, I also discovered that checking connection.connected in the application level code is not entirely sufficient, because the following sequence of events can happen:

  1. Within the handler greenlet, we do either msg = connection.receive. This blocks execution on the handler greenlet until there's a message in the queue, and yields to allow other greenlets to make progress.
  2. At this point, we resume execution at ready = wait([handler, send_event, recv_event], None, 1).
  3. We attempt to send a message. We get an IOError, so we set connection.connected to False. At this point, since the handler greenlet yielded because it was waiting for a message, and we were just trying to send a message, then it is still going to be blocked: there isn't a message for it to get on the queue.
  4. We go back to the top of the while True loop, notice that the connection isn't connected, and join to the handler greenlet. But that handler greenlet is stuck waiting for a receive event that will never come, because we've killed the listener thread (and even if we hadn't, it's never going to receive a message, since the connection is closed). And so we never return a value :(.

I think this is a correct summation, but I could be missing something...

Based on this, it is possible that there's a way to judiciously use yield in _gevent.py to avoid this... but I'm not 100% sure right now: there might be additional similar edge cases that get us into trouble that I haven't thought of yet.

More generally, even if there's a solution which prevents this happening when making calls to methods defined in this library, it's still going to be possible for users to write code which will infinitely block. Of course, one might reasonably say that users ought not to do so, but I suspect that many users might not fully grasp the implications of potentially infinitely blocking code in a while True loop: they might (perhaps irrationally), actually expect the loop to be forcibly disrupted in the event of a broken connection.

matthewowen avatar Jan 26 '15 23:01 matthewowen

All that being said, you're right that the timeout ought to mitigate against this, so maybe I'm going crazy and it is something else. And of course, you do put a None on the receive queue, so I suppose that shouldn't be quite right, either...

matthewowen avatar Jan 26 '15 23:01 matthewowen

All the more reason we need a test suite to verify anything works as expected :)

zeekay avatar Jan 27 '15 12:01 zeekay