channels_redis "took too long to disconnect" issue is back

"took too long to disconnect" issue is back

Open beaugunderson opened this issue 6 years ago • 90 comments

With no deployment or code changes this just started happening today:

screenshot

Feb 28 '18 00:02 beaugunderson

Looking for more logs because I know this isn't helpful as-is; sorry!

Feb 28 '18 00:02 beaugunderson

Was working fine all morning, then just starting spewing tons of "took too long to disconnect" errors; more stats:

screenshot

Feb 28 '18 00:02 beaugunderson

Also seeing a ton of "Exception inside application: Lock is not acquired." at the same time

Feb 28 '18 00:02 beaugunderson

Lock is not acquired errors are likely happening because you hit the application close timeout and are being force-killed. I'd look into why the app isn't closing.

Feb 28 '18 00:02 andrewgodwin

Last time this state happened it was something in the redis db itself, as a flushdb fixed it--working on getting a dump of the db and will then flushdb and see if that fixes things; if so will try to figure out why

Feb 28 '18 01:02 beaugunderson

flushdb did indeed fix the issue; i'm looking through the dump now for clues

Feb 28 '18 02:02 beaugunderson

OK one weird thing is that my zsets for groups are huge, meaning they are not ever being expired... e.g. "asgi::group:account." contains ~100 entries, and for an organization, which all members join when they login, I see ~2,000 entries (for 20 actual users)... so it seems like this either a bug or more likely my fault for not evicting these, but it also seems like a bug for channels_redis to fall over if these numbers get too large, which happened earlier today

Feb 28 '18 02:02 beaugunderson

I wrote a script to generate these stats:

#!/usr/bin/env python

import json

import jsonlines


def main():
    # production.json is generated with redis-dump (`gem install redis-dump`)
    with jsonlines.open('./production.json') as reader:
        for line in reader:
            if line['key'].startswith('asgi::group:account.'):
                print(f'asgi::group:account.<censored>: {len(line["value"])}')
            elif line['key'].startswith('asgi::group:organization.'):
                print(f'asgi::group:organization.<censored>: {len(line["value"])}')
            elif line['key'].startswith('asgi:specific.'):
                print(f'asgi::specific.<censored>: {len(line["value"])}')
            else:
                print(json.dumps(line, indent=2))


if __name__ == '__main__':
    main()

asgi::group:account.<censored>: 107
asgi::group:account.<censored>: 127
asgi::group:account.<censored>: 127
asgi::group:account.<censored>: 135
asgi::group:account.<censored>: 136
asgi::group:account.<censored>: 14
asgi::group:account.<censored>: 140
asgi::group:account.<censored>: 174
asgi::group:account.<censored>: 188
asgi::group:account.<censored>: 19
asgi::group:account.<censored>: 190
asgi::group:account.<censored>: 193
asgi::group:account.<censored>: 197
asgi::group:account.<censored>: 198
asgi::group:account.<censored>: 23
asgi::group:account.<censored>: 26
asgi::group:account.<censored>: 266
asgi::group:account.<censored>: 269
asgi::group:account.<censored>: 327
asgi::group:account.<censored>: 4
asgi::group:account.<censored>: 442
asgi::group:account.<censored>: 49
asgi::group:account.<censored>: 49
asgi::group:account.<censored>: 54
asgi::group:account.<censored>: 654
asgi::group:account.<censored>: 72
asgi::group:account.<censored>: 90
asgi::group:organization.<censored>: 125
asgi::group:organization.<censored>: 1681
asgi::group:organization.<censored>: 2
asgi::group:organization.<censored>: 2017
asgi::group:organization.<censored>: 90
asgi::specific.<censored>: 100
asgi::specific.<censored>: 102

Feb 28 '18 02:02 beaugunderson

Well, group membership takes a day to expire by default, but the discard handler should be removing these (unless something is locking up your application on disconnect, which it sounds like is happening) - so this might be a symptom, not a cause?

Feb 28 '18 03:02 andrewgodwin

This is my websocket_disconnect handler:

    def websocket_disconnect(self, message):
        logger.info(f'{self.account_key}: websocket_disconnect called\nmessage: {message}')

        if hasattr(self, 'presence_groups'):
            for group in self.presence_groups:
                PresenceGroup.objects.remove(group, self.channel_name)

        raise StopConsumer()

I'm using a custom implementation of PresenceGroup that keeps them in the database. I don't see any exceptions occurring from this code, and the only thing that could cause it to take longer than ~500ms is a deadlock, which I also don't see evidence of... I'm adding additional logging though, and I'll have to audit that code and look at how the discard handler works in vanilla channels.

Mar 01 '18 03:03 beaugunderson

(Also noting that after the flushdb yesterday I hit this again this morning and it was fixed by another flushdb)

Mar 01 '18 03:03 beaugunderson

I can't see anything obvious there - even if you had an exception that raised and skipped over the StopConsumer it would still halt the coroutine and log to console. Hm. It's especially weird that Redis is blocking this up since you don't appear to be talking to Redis inside it.

Mar 01 '18 03:03 andrewgodwin

I do call async_to_sync(group_discard)() inside that call to remove()...

Mar 01 '18 03:03 beaugunderson

Ah, curious, calling async inside sync inside async. That would make me a little suspicious.

Mar 01 '18 03:03 andrewgodwin

To be precise - when you call sync code, it uses a fixed-size threadpool, and when you then call async code, it blocks that thread while the async loop does its thing. May I suggest setting the environment variable ASGI_THREADS to something like 1000 or 2 and see if that makes it better or worse?

Mar 01 '18 04:03 andrewgodwin

hmm interesting! still wrapping my head around all of the idiosyncrasies of asyncio development; will try that and report back

Mar 01 '18 04:03 beaugunderson

ok so I looked at the async_to_sync code and I didn't realize it created a threadpool every time it was called... when I first implemented it I threw this in a utilities.py so all of my calls are using the same threadpool:

from asgiref.sync import async_to_sync

from channels.layers import get_channel_layer

channel_layer = get_channel_layer()

sync_group_add = async_to_sync(channel_layer.group_add)
sync_group_discard = async_to_sync(channel_layer.group_discard)

sync_group_send = async_to_sync(channel_layer.group_send)

(I import sync_group_add etc. wherever it's needed...)

bad pattern?

Mar 01 '18 05:03 beaugunderson

async_to_sync doesn't make a threadpool - sync_to_async does, so none of those calls are using a threadpool (async_to_sync will instead try to find an event loop to run it on). I don't see a problem with what you're doing.

Mar 01 '18 05:03 andrewgodwin

ah, sorry, misunderstood

Mar 01 '18 05:03 beaugunderson

@andrewgodwin so is websocket_disconnect called in an async context? (trying to understand your comment re: async inside sync inside async)

Mar 01 '18 06:03 beaugunderson

If you're using SyncConsumer, which I presume you are by the code, then websocket_disconnect is being run inside sync_to_async for you already. This means that when you run something inside async_to_sync it's double-nesting. This should be fine, and we have a test suite that makes sure it is, but testing it once is once thing versus potential threading leaks in the real world!

I would not be surprised if something in there was the cause, but until we can get slightly more info I won't be able to even start trying to replicate and test this. This is why I suggested playing with ASGI_THREADS as that may provide useful datapoints.

Mar 01 '18 06:03 andrewgodwin

I set ASGI_THREADS=1000 yesterday and things appeared to be going well until a few minutes ago:

screenshot (these are the occurrence of the "took too long" errors)

A redis flushdb fixed the issue again.

Happy to add as much logging as needed to pin this down, any pointers on what I'm looking for?

Mar 02 '18 00:03 beaugunderson

It's more timing I'm interested in. If you add more threads, is it longer until the error happens? If you reduce them, does it happen faster?

Mar 03 '18 02:03 andrewgodwin

been working on a repro:

https://github.com/beaugunderson/message-repro

this reliably produces the "took too long to shut down and was killed" error. not sure if it's the same mechanism but seems like a valid issue...

Mar 03 '18 07:03 beaugunderson

note that the python part of the repro may not even be important; i imagine pointing the connect.js in that repro at any channels 2 app would probably trigger this issue

Mar 03 '18 07:03 beaugunderson

re: ASGI_THREADS timing, setting to 1000 did not seem to have an effect; still happened after 20-26 hours or so (which may correspond to some busy time of day for the people using the app)

Mar 03 '18 07:03 beaugunderson

Hmm OK, I'll try to take a look in the next week or two. Thanks for investigating the threads setting - that rules out it being a thread deadlock.

Mar 03 '18 23:03 andrewgodwin

I have the same issue: channel_redis = 2.1.1 daphne = 2.1.1 channels = 2.1.1

CPU usage

Apr 28 '18 00:04 felipe3dfx

I have same issue too that I use the JsonWebsocketConsumer, I thought the something happened between async_to_sync and sync_to_async

Jun 27 '18 09:06 PoloPing

Any news on this one? Getting the same problem. Trying the ASGI_THREADS=1000 effectively delays the problem... still waiting to see how many hours it takes before it hangs again.

Aug 02 '18 18:08 lorinkoz

channels_redis channels_redis copied to clipboard

"took too long to disconnect" issue is back

channels_redis
channels_redis copied to clipboard