channels_redis
channels_redis copied to clipboard
"took too long to disconnect" issue is back
With no deployment or code changes this just started happening today:
Looking for more logs because I know this isn't helpful as-is; sorry!
Was working fine all morning, then just starting spewing tons of "took too long to disconnect" errors; more stats:
Also seeing a ton of "Exception inside application: Lock is not acquired." at the same time
Lock is not acquired errors are likely happening because you hit the application close timeout and are being force-killed. I'd look into why the app isn't closing.
Last time this state happened it was something in the redis db itself, as a flushdb
fixed it--working on getting a dump of the db and will then flushdb
and see if that fixes things; if so will try to figure out why
flushdb
did indeed fix the issue; i'm looking through the dump now for clues
OK one weird thing is that my zset
s for groups are huge, meaning they are not ever being expired... e.g. "asgi::group:account.channels_redis
to fall over if these numbers get too large, which happened earlier today
I wrote a script to generate these stats:
#!/usr/bin/env python
import json
import jsonlines
def main():
# production.json is generated with redis-dump (`gem install redis-dump`)
with jsonlines.open('./production.json') as reader:
for line in reader:
if line['key'].startswith('asgi::group:account.'):
print(f'asgi::group:account.<censored>: {len(line["value"])}')
elif line['key'].startswith('asgi::group:organization.'):
print(f'asgi::group:organization.<censored>: {len(line["value"])}')
elif line['key'].startswith('asgi:specific.'):
print(f'asgi::specific.<censored>: {len(line["value"])}')
else:
print(json.dumps(line, indent=2))
if __name__ == '__main__':
main()
asgi::group:account.<censored>: 107
asgi::group:account.<censored>: 127
asgi::group:account.<censored>: 127
asgi::group:account.<censored>: 135
asgi::group:account.<censored>: 136
asgi::group:account.<censored>: 14
asgi::group:account.<censored>: 140
asgi::group:account.<censored>: 174
asgi::group:account.<censored>: 188
asgi::group:account.<censored>: 19
asgi::group:account.<censored>: 190
asgi::group:account.<censored>: 193
asgi::group:account.<censored>: 197
asgi::group:account.<censored>: 198
asgi::group:account.<censored>: 23
asgi::group:account.<censored>: 26
asgi::group:account.<censored>: 266
asgi::group:account.<censored>: 269
asgi::group:account.<censored>: 327
asgi::group:account.<censored>: 4
asgi::group:account.<censored>: 442
asgi::group:account.<censored>: 49
asgi::group:account.<censored>: 49
asgi::group:account.<censored>: 54
asgi::group:account.<censored>: 654
asgi::group:account.<censored>: 72
asgi::group:account.<censored>: 90
asgi::group:organization.<censored>: 125
asgi::group:organization.<censored>: 1681
asgi::group:organization.<censored>: 2
asgi::group:organization.<censored>: 2017
asgi::group:organization.<censored>: 90
asgi::specific.<censored>: 100
asgi::specific.<censored>: 102
Well, group membership takes a day to expire by default, but the discard
handler should be removing these (unless something is locking up your application on disconnect, which it sounds like is happening) - so this might be a symptom, not a cause?
This is my websocket_disconnect
handler:
def websocket_disconnect(self, message):
logger.info(f'{self.account_key}: websocket_disconnect called\nmessage: {message}')
if hasattr(self, 'presence_groups'):
for group in self.presence_groups:
PresenceGroup.objects.remove(group, self.channel_name)
raise StopConsumer()
I'm using a custom implementation of PresenceGroup
that keeps them in the database. I don't see any exceptions occurring from this code, and the only thing that could cause it to take longer than ~500ms is a deadlock, which I also don't see evidence of... I'm adding additional logging though, and I'll have to audit that code and look at how the discard handler works in vanilla channels.
(Also noting that after the flushdb
yesterday I hit this again this morning and it was fixed by another flushdb
)
I can't see anything obvious there - even if you had an exception that raised and skipped over the StopConsumer it would still halt the coroutine and log to console. Hm. It's especially weird that Redis is blocking this up since you don't appear to be talking to Redis inside it.
I do call async_to_sync(group_discard)()
inside that call to remove()
...
Ah, curious, calling async inside sync inside async. That would make me a little suspicious.
To be precise - when you call sync code, it uses a fixed-size threadpool, and when you then call async code, it blocks that thread while the async loop does its thing. May I suggest setting the environment variable ASGI_THREADS
to something like 1000
or 2
and see if that makes it better or worse?
hmm interesting! still wrapping my head around all of the idiosyncrasies of asyncio
development; will try that and report back
ok so I looked at the async_to_sync
code and I didn't realize it created a threadpool every time it was called... when I first implemented it I threw this in a utilities.py
so all of my calls are using the same threadpool:
from asgiref.sync import async_to_sync
from channels.layers import get_channel_layer
channel_layer = get_channel_layer()
sync_group_add = async_to_sync(channel_layer.group_add)
sync_group_discard = async_to_sync(channel_layer.group_discard)
sync_group_send = async_to_sync(channel_layer.group_send)
(I import sync_group_add
etc. wherever it's needed...)
bad pattern?
async_to_sync
doesn't make a threadpool - sync_to_async
does, so none of those calls are using a threadpool (async_to_sync
will instead try to find an event loop to run it on). I don't see a problem with what you're doing.
ah, sorry, misunderstood
@andrewgodwin so is websocket_disconnect
called in an async context? (trying to understand your comment re: async inside sync inside async)
If you're using SyncConsumer
, which I presume you are by the code, then websocket_disconnect
is being run inside sync_to_async
for you already. This means that when you run something inside async_to_sync
it's double-nesting. This should be fine, and we have a test suite that makes sure it is, but testing it once is once thing versus potential threading leaks in the real world!
I would not be surprised if something in there was the cause, but until we can get slightly more info I won't be able to even start trying to replicate and test this. This is why I suggested playing with ASGI_THREADS
as that may provide useful datapoints.
I set ASGI_THREADS=1000
yesterday and things appeared to be going well until a few minutes ago:
(these are the occurrence of the "took too long" errors)
A redis flushdb
fixed the issue again.
Happy to add as much logging as needed to pin this down, any pointers on what I'm looking for?
It's more timing I'm interested in. If you add more threads, is it longer until the error happens? If you reduce them, does it happen faster?
been working on a repro:
https://github.com/beaugunderson/message-repro
this reliably produces the "took too long to shut down and was killed" error. not sure if it's the same mechanism but seems like a valid issue...
note that the python part of the repro may not even be important; i imagine pointing the connect.js
in that repro at any channels 2 app would probably trigger this issue
re: ASGI_THREADS timing, setting to 1000 did not seem to have an effect; still happened after 20-26 hours or so (which may correspond to some busy time of day for the people using the app)
Hmm OK, I'll try to take a look in the next week or two. Thanks for investigating the threads setting - that rules out it being a thread deadlock.
I have the same issue: channel_redis = 2.1.1 daphne = 2.1.1 channels = 2.1.1
I have same issue too that I use the JsonWebsocketConsumer, I thought the something happened between async_to_sync and sync_to_async
Any news on this one? Getting the same problem. Trying the ASGI_THREADS=1000
effectively delays the problem... still waiting to see how many hours it takes before it hangs again.