channels new connected channels may fail to receive message from group

It looks occasionally some channels won't receive messages from using group_send(), although they all successfully connected to our server. Disconnected and re-connect may resolve the issue. Here is the sample:

two clients connected via websocket and joined a same group partner_orders_1531

127.0.0.1:6379[1]> zrange asgi::group:partner_orders_1531 0 -1 withscores
1) "specific.mrOZIufu!GAaMMeISqdry"
2) "1549585254.4294083"
3) "specific.qUSiCEQD!TGlphabkczyd"
4) "1549585307.3211055"

then calling async_to_sync(channel_layer.group_send)

channel_layer = get_channel_layer()
async_to_sync(channel_layer.group_send)(partner_order_group_name, order_data)

but only one channel with channel name qUSiCEQD!TGlphabkczyd was able to receive the message and triggered

class PartnerOrderConsumer(WebsocketConsumer):
        ...
        def order_receive(self, event):
            self.send(text_data=event["text"])

related redis log :

Pip & Environment: Python 3.6.6 channels==2.1.5 channels-redis==2.3.1 daphne==2.2.3 Django==1.11.10 The server is hosted on Ubuntu 16.04.5 LTS with 2vCPUs/4GB MEMORY by Digitalocean. We have two daphne instances and using nginx for load balancing.

Feb 08 '19 02:02 DrChai

same issue here, please help, thanks!

Feb 08 '19 20:02 xuehaoze

I noticed the same behaviour a couple of weeks ago. My setup was python manage.py runserver behind nginx for SSL offloading. It did not appear to happen when the websocket connection was established to the runserver process without nginx in front.

It did not seem to appear when the software is running via daphne, but I'll have to check it in more detail to be sure.

Feb 09 '19 08:02 bastbnl

We experience this issue as well. At first we thought it was an issue with the channels-redis backend so we moved over to a rabbitmq backend, but this has not solved the problem. It seems that daphne workers timeout to their backend and stop receiving subscribed group_sends

Mar 25 '19 10:03 qeternity

It seems that daphne workers timeout to their backend and stop receiving subscribed group_sends

So, can we pin that down to a specific issue with Daphne? (If so a ticket there with a reproduce would be wonderful.) (If not some kind of reproduce would be helpful...)

Mar 26 '19 10:03 carltongibson

...behind nginx for SSL offloading.

So is it an nginx config issue?

Mar 26 '19 10:03 carltongibson

This isn't really actionable as is: there's nothing pointing to channels. Likely the issue is elsewhere in the stack.

I'm going to close for now. If someone can put together a reproduce showing channels (or Daphne) is at fault, I'm happy to re-open.

Mar 26 '19 10:03 carltongibson

@carltongibson I'm not sure why you'd close it. Surely the fact that there's nothing actionable means that it warrants further investigation. It's not an NGINX issue because simply issuing a reload (kill -HUP) to daphne immediately fixes the problem. Also, anything that isn't group_send based (i.e. application level ping-pong) still work just fine.

Mar 26 '19 10:03 qeternity

It's not an addressable issue as it stands. Is it DigitalOcean? Is it Nginx? Is it Daphne? Is it Channels? If you read the comments here, there is nothing in common... so pending a reproduce there's nothing to do. We don't have capacity to chase down phantom issues with no details. needsinfo is a legitimate closing reason.

If someone can put together a reproduce showing channels (or Daphne) is at fault, I'm happy to re-open.

Mar 26 '19 11:03 carltongibson

If you read the comments here, there is nothing in common

Channels/Daphne/group_send are the common denominator.

Mar 26 '19 12:03 qeternity

It did not seem to appear when the software is running via daphne

As I said, I'm very happy to investigate if anyone can provide a reproduce.

Mar 26 '19 12:03 carltongibson

So it would seem I owe @carltongibson an apology. After further debugging, this appears to be an odd behavior with nginx buffering.

@DrChai @xuehaoze try adding "proxy_buffering off;" to your nginx conf that handles websockets proxying?

EDIT: see below, this does not fix the problem.

Mar 30 '19 15:03 qeternity

@qeternity Thanks for the follow-up and the pointer. Glad you made progress!

Mar 30 '19 20:03 carltongibson

@qeternity @carltongibson Thanks for your reply, we will give it a try!

Mar 30 '19 22:03 xuehaoze

After running this for a few days in production, this does NOT resolve the issue. group_send is still broken, but everything else works. Restarting daphne is the only solution. I have tried both redis and rabbitmq backends so this suggests it's internal daphne/channels.

Apr 03 '19 10:04 qeternity

@qeternity I have to upgrade an old project to this right now, and this issue is going to be a problem for me, as I'm using this same setup. Now, here's something from layers.py that concerns me, in async def group_send(self, group, message):

for channel in self.groups.get(group, set()): 
 try: 
  await self.send(channel, message) 
 except ChannelFull: 
  pass

Neither is there any logging in self.send if the channel is full. It just raises the exception, which is then ignored by this code. If this condition happens, you'll never know it.

So, my first thought is that the group messages are never completely cleaned up from the channel; possibly only in some specific use case, or possibly always. They'd get handled by the worker, but not removed from the channel. This would explain why it takes time for the scenario to happen, because it takes time for the channel to fill up.

So, if you already have a setup where you can make this bug happen, could you add a debug line instead of "pass" to the final line, or perhaps just throw the exception again? If we know the problem is this, then the search for the bug should be much easier. Also, if this is the reason, I'd expect raising the frequency of group_send's in your test code to make it happen faster, unless the problem is caused by a group_send happening simultaneously with some other method call, and only then leaving the channel with one slot less space in it.

This is all pure guesswork. This is the first time I have ever even looked at this project's code.

Apr 03 '19 22:04 mkoponen

Hi @mkoponen. That's interesting. (Gives an avenue to work on at least...) I have some time to work on Channels at DjangoCon Europe next week, so I'll re-open to investigate. (If anyone wants to jump in before then, please do! 🙂)

Apr 04 '19 06:04 carltongibson

@mkoponen @carltongibson I'll debug this week, but we've used two separate backends (redis and rabbitmq) and neither have an implementation like that. That's for the InMemory layer, which isn't what we're using.

Apr 08 '19 17:04 qeternity

Ah, so it is. However, then my next question is, does this condition ever return true in the problem case?

if redis.call('LLEN', KEYS[i]) < tonumber(ARGV[i + #KEYS]) then

I don't understand all the concepts well enough yet to see what this line means, but I do notice that again the method would just fail silently if it didn't.

The first goal should be to figure out that when the problem starts, is the problem in pushing the information into Redis, or receiving the information from it.

EDIT: Another suggestion: Could there be a scenario in the two awaits at the top that seek to remove expired channels, such that they would simply await forever?

EDIT2: I see. This is from "simple" send-method

# Check the length of the list before send
# This can allow the list to leak slightly over capacity, but that's fine.
if await connection.llen(channel_key) >= self.get_capacity(channel):
    raise ChannelFull()

Capacity is also given as argument to the Lua code, so it is clearly trying to do this same check about exceeded capacity. Only it does it in Lua, whereas send() does it in Python. But I'm back to my original hypothesis: The channel gets full, and send starts failing silently.

EDIT3: Further hypothesis. Group messages get cleaned up only through expiry, and never through successful reception and handling of the message. Those users who send them very infrequently never notice the problem, because the slow expiry is enough to leave room for the new message whenever they need it. Those who send them frequently, hit capacity.

Apr 08 '19 17:04 mkoponen

@qeternity Excuse me, Have it been solved now?

Apr 22 '19 05:04 xiaoqiao99

This problem is still not resolved. We are using the rabbitmq backend. Our server locked up again and I went into the rabbitmq admin panel and saw that the queues all had 100 messages waiting in the queue. This is definitely the problem. It would appear that the asgi websockets workers disconnect from the queue and stop receiving messages, which then pile up to the channel limit.

May 16 '19 17:05 qeternity

We are using the rabbitmq backend

So it's not redis... — still looking for a decent reproduce.

I need to get DRF v3.10 out and then I'll have some time to dig into this, and other channels issues. Following up @mkoponen's lead seems most promising currently. If anyone has time to look, that's be great.

May 16 '19 19:05 carltongibson

Indeed, as I wrote above, we switched backends attempting to mitigate this but it had no impact.

May 17 '19 08:05 qeternity

Was anyone able to make any progress on this? Or find a way to reproduce it reliably? Having the same issue.

Oct 03 '19 14:10 maurice-g

I am having the same issue with this. I've tried to create different test app in same env specs and everything is working, but my current app is failing to send correct messages to redis via channel layers.

Any update?

Oct 11 '19 11:10 Kavarakis

This still needs digging into, I'm afraid.

I'm currently looking into issues around AsyncHttpConsumer (in the time I have 😝). Maybe this will turn out to be related to that. If not this is on my list for afterwards.

If someone could pin it down to an example project that reliably reproduced the issue, that would be more that half way to a solution.

Oct 11 '19 12:10 carltongibson

Hi guys, I've managed to fix my issues regarding this problem. The issue was rather stupid where I needed to open my WebSocket connection via frontend app. If I don't open WS connection and Django doesn't receive connection notification and successful handshake channels and Redis cannot send any messages via channel layers (tested and confirmed in my case). I don't see the logic behind this way but I've managed to remove my problem in this case.

I hope this will help someone.

Cheers!

Oct 11 '19 12:10 Kavarakis

Seems like I've been looking for the culprit in the wrong place as well, at least for some cases.

For the Googlers coming here: The issue was, that there wasn't actually a websockets connection anymore. We did listen to close and error events on the websocket connection in the frontend, and reacted by automatically trying to reconnect. Thus, I assumed, there would always be an active WS connection to django channels. Turns out that in some cases (such as just disconnecting your wifi) no close or error event will be thrown (tested in Chrome 77) and the browser will happily report the connection as active forever. The backend will close the connection from its side after ~50s (no idea where that number is coming from.. Channels / Daphne / Load balancer?) and therefore also send no further notifications to that channel. Solution for us was implementing manual heartbeat messages from frontend to the backend every few seconds. Contrary to my understanding of websockets, this is not handled by the Browser (whatever use the Ping/Pong has then..).

Oct 24 '19 08:10 maurice-g

@maurice-g Sorry I am a bit new to this whole websockets. How do you implement manual heartbeat from frontend to backend?

I am using the following:

celery[redis]==4.3.0
channels==2.2.0  
channels-redis==2.4.0

Nov 07 '19 06:11 simkimsia

I faced this issue and found two culprits so far:

Client silently disconnects If you're using django-channels to communicate to a websocket running in a browser, then this is the most likely cause. I noticed that if my users lock their device then come back to my web app after a certain amount of time has passed, then the connection will not be there anymore but it will not have thrown any errors either. The solution is to do what @maurice-g suggested and have a manual heartbeat between the browser and the backend
group_send in channels_redis silently failing if channel over capacity Following the lead from @mkoponen I noticed that when you use group_send() with channels_redis as the backend, then there is the potential for group_send() to silently fail. When using group_send, before sending the current message the lua script will check the number of messages in the channel. If the channel is not yet at full capacity, it will send the message. However if the channel has hit full capacity then this channel will be skipped silently. I wrote a PR to fix this so the number channels that are at full capacity will be logged, which will help with debuggin.

Nov 09 '19 01:11 tarikki

Here is always reproducible example of similar issue https://stackoverflow.com/questions/59195607/django-channels-daphne-and-docker-clients-stop-receiving-messages-when-the-fir

Dec 22 '19 18:12 SerhiiK17

channels
channels copied to clipboard

new connected channels may fail to receive message from group_send()

channels channels copied to clipboard

new connected channels may fail to receive message from group_send()

channels
channels copied to clipboard