channels_redis Sending messages is slow when using groups

Sending messages is slow when using groups

Open remusmp opened this issue 6 years ago • 28 comments

It seems that groups slow down sending messages ~100 times: I've created 3 different echo consumers in order to compare them:

AsyncJsonWebsocketConsumer with groups
AsyncConsumer with groups
AsyncConsumer without groups

I got these timing results:

1.
Elapsed time 0.026857614517211914.
Elapsed time 0.026637554168701172.
Elapsed time 0.03318667411804199.
Elapsed time 0.03062891960144043.
Elapsed time 0.026413679122924805.
Elapsed time 0.028857707977294922.
Elapsed time 0.03323197364807129.
Elapsed time 0.026538610458374023.
Elapsed time 0.036324501037597656.
Elapsed time 0.0322566032409668.
Elapsed time 0.0385587215423584.
Elapsed time 0.02066636085510254.
Elapsed time 0.03473186492919922.
Elapsed time 0.03095531463623047.
Elapsed time 0.022891759872436523.

2.
Elapsed time 0.02672290802001953.
Elapsed time 0.022529125213623047.
Elapsed time 0.042157888412475586.
Elapsed time 0.028131961822509766.
Elapsed time 0.035176753997802734.
Elapsed time 0.03284716606140137.
Elapsed time 0.03023362159729004.
Elapsed time 0.03440999984741211.
Elapsed time 0.029916763305664062.
Elapsed time 0.03461766242980957.

3. 
Elapsed time 0.00019168853759765625.
Elapsed time 9.107589721679688e-05.
Elapsed time 0.00020265579223632812.
Elapsed time 0.0004429817199707031.
Elapsed time 0.0002741813659667969.
Elapsed time 0.00019693374633789062.
Elapsed time 0.00026679039001464844.
Elapsed time 0.0003097057342529297.
Elapsed time 0.00017261505126953125.
Elapsed time 0.0003085136413574219.
Elapsed time 0.00048232078552246094.
Elapsed time 0.0002384185791015625.

Here is the consumers code:

from channels.generic.websocket import AsyncJsonWebsocketConsumer, AsyncConsumer
from time import time
from channels.layers import get_channel_layer
from asgiref.sync import async_to_sync, sync_to_async

class EchoConsumer1(AsyncJsonWebsocketConsumer):

    async def connect(self):
        await self.accept()
        await self.channel_layer.group_add("test", self.channel_name)

    async def disconnect(self, close_code):
        await self.channel_layer.group_discard("test", self.channel_name)

    async def receive_json(self, content):
        start = time()
        await self.channel_layer.group_send(
            "test",
            {
                "type": "chat.message",
                "json": content
            }
        )
        print("Elapsed time {}.".format(time()-start))

    async def chat_message(self, event):
        await self.send_json(event["json"])


class EchoConsumer2(AsyncConsumer):

    async def websocket_connect(self, event):
        await self.send({
            "type": "websocket.accept",
        })
        await self.channel_layer.group_add("test", self.channel_name)

    async def websocket_send(self, event):
        await self.send(event)

    async def websocket_receive(self, event):
        start = time()
        channel_layer = get_channel_layer()
        await channel_layer.group_send(
            "test",
            {
                "type": "websocket.send",
                "text": event["text"],
            }
        )
        print("Elapsed time {}.".format(time()-start))


class EchoConsumer3(AsyncConsumer):

    async def websocket_connect(self, event):
        await self.send({
            "type": "websocket.accept",
        })

    async def websocket_receive(self, event):
        start = time()
        await self.send({
            "type": "websocket.send",
            "text": event["text"],
        })
        print("Elapsed time {}.".format(time()-start))

Mar 01 '18 19:03 remusmp

I'll try to look into this in the next couple of weeks.

Mar 03 '18 23:03 andrewgodwin

Is Redis even involved in the third consumer?

Mar 23 '18 02:03 agronick

I'm also experiencing very slow behavior when using groups.

Jul 09 '18 18:07 SHxKM

Any update on this? Or do we have a solution what hasn't come up here? (We have this issue, but we don't use the latest versions. I would change to the latest/working version, if I could see that it's worth.)

Oct 03 '18 11:10 naszta

any update

Nov 27 '18 16:11 Zach-Edge

Many thanks to @andrewgodwin for all your hard work on this great project! Are there any updates regarding this bug?

I am considering whether to use on the long term django channels on our site in production (we were hit by the same problem in testing), but I see that this bug has been open for over a year.

Apr 10 '19 00:04 physicsai

any update

These comments don't add anything. You can see there's no update, otherwise there'd be an update... 🙂

This needs someone experiencing the issue to spend the time to dig into it. What we don't have here is Why it's happening?

Short of that, it'll just have to sit for a while.

May 09 '19 06:05 carltongibson

I have tried to re-test this if issue was still present (also checking some differences between different versions of channels-redis and channels).

These are my results:

## channels 2.2.0  + channel_redis 2.2.1
consumer, mean, variance, max, min
 v1, 0.003962, 4.2704E-7, 0.011806, 0.003338
 v2, 0.004009, 2.4160E-7, 0.006387, 0.003317
 v3, 0.000079, 1.1771E-9, 0.000275, 0.000055

## channels 2.3.1  + channel_redis 2.3.3
consumer, mean, variance, max, min
 v1, 0.000831, 1.8120E-8, 0.002146, 0.000706
 v2, 0.000826, 3.7920E-8, 0.002566, 0.000683
 v3, 0.000069, 8.9345E-10, 0.000288, 0.000053

## channels 2.4.0  + channel_redis 2.4.2
consumer, mean, variance, max, min
 v1, 0.000814, 6.0222E-9, 0.001563, 0.000715
 v2, 0.000810, 2.2231E-8, 0.002422, 0.000684
 v3, 0.000069, 6.4959E-10, 0.000269, 0.000054

Each test was performed with 200 requests to same echo-consumer. Seems that in v2.2 there is an improvement in performances (5x faster) but with just 1 group performance are quite bad compared to simple send (~12x slower).

Some good work was done, but I belive that is bad to slow down by an order of magnitude with a single group.

However for the sake of completeness I will try a more complex test with more than 1 group to see if there are drawbacks or not.

Apr 13 '20 15:04 sevdog

Hi @sevdog. Thanks. If you can work out where the time is being spent, that should help... 🙂

Apr 13 '20 15:04 carltongibson

I had run a test logging times inside send_group function in 3 different points:

after connection (line 614)
after map-channels-to-connection (line ~628)
at the end

I have calculated elapsed time since function get initially called.

These are times stats:

# connection
0.000024, 4.6381E-11, 0.000082, 0.000020
# map channels 
0.000459, 1.2593E-8, 0.001299, 0.000402
# whole function
0.000903, 3.4705E-8, 0.002230, 0.000780

From these data seems that _map_channel_keys_to_connection takes about the same time that the following for loop.

This was still tested with just 1 group.

I will do some other checks with 2 and 10 groups as soon as I can.

Apr 16 '20 19:04 sevdog

I have to apologize for my previous statement. Seems that what takes about 0.000210 seconds (with a single client connected) is this part: https://github.com/django/channels_redis/blob/0c27648b60a1918c867b96a5ad3f67ef58ccceda/channels_redis/core.py#L614-L622

I have also performed an other test with 3 different clients which connects to my 3 test consumers, these are times:

## channels 2.2.0  + channel_redis 2.2.1
 v1, 0.002098, 1.0687E-7, 0.003900, 0.001585
 v2, 0.002735, 1.2245E-6, 0.010026, 0.001693
 v3, 0.000074, 3.8980E-10, 0.000122, 0.000052
## channels 2.3.1  + channel_redis 2.3.3
 v1, 0.002400, 1.2855E-6, 0.010583, 0.001343
 v2, 0.002093, 1.0505E-7, 0.002888, 0.001016
 v3, 0.000078, 6.7096E-10, 0.000119, 0.000048
## channels 2.4.0  + channel_redis 2.4.2
 v1, 0.001904, 5.5975E-8, 0.003261, 0.001259
 v2, 0.001906, 5.8758E-8, 0.003391, 0.001102
 v3, 0.000065, 2.4156E-10, 0.000095, 0.000050

I will try to add more automation to my actual tests to have a better poit-of-view fo how the situation evolves with more clients (maybe with a graph or something similar).

Apr 24 '20 06:04 sevdog

Any updates on this Issue?

May 11 '20 13:05 nhondong

I have done some other tests with a growing number of client connecting to these test consumers (from 1, 6, 11, 16, 21, 26 and 31) having these results:

Schermata da 2020-05-16 11-59-17

I dubt that this ha something goog, since I also need a comparison with channels v1, since this should be tested on a multi-host enviroment. My tests were performed on my machine with docker, but they should be done on a semi-real environmet (server and clients separeted).

Also a better way to collect time-related data should be used (instead of print/log).

May 16 '20 10:05 sevdog

It seems, that django channels is not realy in use in production environments.. This problem is a no go.

May 17 '20 16:05 nhondong

I do not have tested against previous version (channels v1.x aesgi-redis 1.x), so this is not a full comparison between major versions.

Also, as I stated before, an accurate test requires different hosts for clients and server. This requires more time (and money).

Moreover my tests were only performed against Python 3.6, it should also be tested against other versions because this may have some kind of reletion with asyncio handling.

May 20 '20 06:05 sevdog

Is there a way to send directly between consumers in the meantime? It seems like calling

 #add people to a group, then 
 await self.channel_layer.group_send("targetgroup", functionstuff)

is the current problem, but what about:

 #manually create a group or know about it beforehand
 foreach consumer in group:
     await self.channel_layer.send(consumer, functionstuff)

I'll try and test this soon, however managing groups ourselves seems like a pain.

EDIT: reading docs, it says that you could maintain the groups yourself. I'll have a look at just using channel_layer.send() vs channel_layer.group_send() with 1-3 consumers

Jun 04 '20 04:06 alex-ong

Running a quick test, and tested: self.channel_layer.send(self.channel_name) vs self.channel_layer.group_send(group that only includes myself)

I got 1ms and 2ms respectively. I also noticed that the v3 test doesn't even use redis? Running v3 gives me the 0.000005~ (5 thousandths of an ms) that the previous testers got.

What does self.send() do? It seems like its a websocket send, not a redis send?

Is the goal to get inter process communication to be as fast as websocket? That makes sense on some level; but what portion of python sending it to pickling, sending to redis, then receiving from redis and unpickling is the bottleneck?

Jun 05 '20 08:06 alex-ong

#223 was related. (Speed up reported switching redis connection to unix sockets.)

Sep 18 '20 06:09 carltongibson

Channels appears to create a number of lua scripts that get executed. Is it worth considering "pre-loading" those scripts into redis at start-up and then just call them during processing? I'm wondering whether or not some of these delays are caused by the scripts being loaded and parsed before being executed. (Unfortunately, I don't know of any way of measuring that.)

Sep 29 '20 15:09 KenWhitesell

Hey @KenWhitesell. :wave: I don't think that's the issue. They're cached on first load. https://github.com/django/channels_redis/commit/572aa946b58b8187da542efb375d73bf526030d7 was added to make the most of this.

Sep 29 '20 16:09 carltongibson

I just used yappi to do a more accurate profiling of this call, using this code.

Using 32 clients (which were processes on an other host) I got the following results:

name                         ncall  tsub      ttot      tavg     
RedisChannelLayer.group_send  1      0.003629  0.005724  0.005724  
RedisChannelLayer.group_send  1      0.004395  0.006690  0.006690 
RedisChannelLayer.group_send  1      0.006053  0.009821  0.009821 
RedisChannelLayer.group_send  1      0.004937  0.007409  0.007409 
RedisChannelLayer.group_send  1      0.004702  0.008454  0.008454

I then tried to figure out how the internal code of the method could be improved and got interested in this piece of code:

https://github.com/django/channels_redis/blob/243eb7e4ecfc2e922508e50210884d3c9bae5403/channels_redis/core.py#L665-L668

Since these tasks does not have any relation which prevents concurrent execution this can be improved via a divide-et-impera using asyncio.gather. So I tried this code:

await asyncio.gather(*[
    connection.zremrangebyscore(
        key, min=0, max=int(time.time()) - int(self.expiry)
    ) for key in channel_redis_keys
])

Which gives the following time results:

name                         ncall  tsub      ttot      tavg     
RedisChannelLayer.group_send  1      0.002697  0.005851  0.005851 
RedisChannelLayer.group_send  1      0.001918  0.003774  0.003774  
RedisChannelLayer.group_send  1      0.004098  0.008801  0.008801 
RedisChannelLayer.group_send  1      0.002748  0.005214  0.005214
RedisChannelLayer.group_send  1      0.003148  0.006863  0.006863

Within my test this has reduced both overall and internal time.

Jan 12 '21 07:01 sevdog

Hey sevdog, thanks for the update. Once Django 3.2a1 is out the way I will be swinging back to Channels so will take a look at this. 👍

Jan 12 '21 08:01 carltongibson

@carltongibson any updates here?

Feb 11 '21 14:02 pbylina

@carltongibson @andrewgodwin anything new? I would like to implement channels for a scalable application (expecting 1000+ users in a group). If we cannot do that in the near future, should I consider implementing my own channel layer?

Are there any alternatives? What service should I use for the layer?

Apr 26 '21 04:04 aryaniyaps

Hi both.

No, this is still pending investigation. I've not had bandwidth as yet. If you'd like to dig in that would be super!

Apr 26 '21 05:04 carltongibson

I have a doubt @carltongibson , Here Andrew says that channels should not be used for extensive usage in the first place, and that we need to roll out our own solutions for sending messages between multiple nodes. If this issue is fixed, does that mean that we can use channels for sending messages to hundreds of nodes per group (assuming that group sending would become faster)?

Apr 26 '21 05:04 aryaniyaps

@aryan340 without testing there's no real way to say. I would suspect though that at scale you'd want to be looking into more fully featured and battle tested message brokers. There's a lot in this space that's simply out of scope for us here.

Apr 26 '21 08:04 carltongibson

For those following this issue, it might be interesting for you to know there is a new channel layer implementation which is a lightweight layer on top of Redis Pub/Sub.

Redis Pub/Sub is efficient & fast. This new channel layer thinly wraps it so we can borrow its speed/efficiency. If you run a ton of messages through your app's channel layer (particularly if using a lot of group_send()) then I expect you'll see a big performance boost by switching to this new channel layer (we did - we've been running it in production for 6 months and it's been great).

It's called RedisPubSubChannelLayer. See channels_redis's REDME for info on how to use it. It's still considered alpha-stage so proceed with caution and try it first in your dev/staging environment. Please file an issue if you find one.

Hope it helps someone!

Disclaimer: I started the RedisPubSubChannelLayer, so I'm biased toward positive feelings about it. But, it has already had other awesome people contribute fixes and improvements (biggest shoutout to @qeternity). :+1:

Oct 24 '21 03:10 acu192

channels_redis channels_redis copied to clipboard

Sending messages is slow when using groups

channels_redis
channels_redis copied to clipboard