nchan icon indicating copy to clipboard operation
nchan copied to clipboard

Interprocess alerts in transit grows. Publishing hangs.

Open gazugafan opened this issue 4 years ago • 4 comments

Running into some strange behavior with a new server setup. Everything seemed fine at first, but it seems that sometimes trying to publish a message hangs, and then interprocess alerts in transit keeps growing. Once this happens, it becomes impossible to publish to any channel. For example...

sudo systemctl restart nginx
curl --request POST --data "testing" http://127.0.0.1:8080/nchan_stub_status
total published messages: 0
stored messages: 0
shared memory used: 20K
shared memory limit: 1048576K
channels: 3
subscribers: 8
redis pending commands: 0
redis connected servers: 0
total interprocess alerts received: 14
interprocess alerts in transit: 3
interprocess queued alerts: 3
total interprocess send delay: 0
total interprocess receive delay: 0
nchan version: 1.2.7
curl --request POST --data "testing" http://127.0.0.1:8080/pub/test
***no response here. Need to CTRL+C

curl --request POST --data "testing" http://127.0.0.1:8080/nchan_stub_status
total published messages: 1
stored messages: 0
shared memory used: 24K
shared memory limit: 1048576K
channels: 3
subscribers: 8
redis pending commands: 0
redis connected servers: 0
total interprocess alerts received: 14
interprocess alerts in transit: 5
interprocess queued alerts: 5
total interprocess send delay: 0
total interprocess receive delay: 0
nchan version: 1.2.7

Here's the NCHAN portion of the NGINX config...

nchan_shared_memory_size 1G;
nchan_message_buffer_length 500;

server {
        listen 127.0.0.1:8080;
        location ~ /pub/(.*)$ {
                nchan_publisher;
                nchan_channel_id "$1";
                nchan_channel_id_split_delimiter ",";
        }

        location /nchan_stub_status {
                nchan_stub_status;
        }
}

server {
    listen 443 ssl; # managed by Certbot
    ssl_certificate /etc/letsencrypt/live/mydomain.com/fullchain.pem; # managed by Certbot
    ssl_certificate_key /etc/letsencrypt/live/mydomain.com/privkey.pem; # managed by Certbot
    include /etc/letsencrypt/options-ssl-nginx.conf; # managed by Certbot
    ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; # managed by Certbot

    server_name  mydomain.com;
    root /var/www/public;
    index index.php index.html index.htm;

    location ~ /sub/(.*)$ {
        nchan_subscriber;
        #nchan_authorize_request /_nchan_auth;
        nchan_channel_id "$1";
        nchan_channel_id_split_delimiter ",";
        nchan_subscriber_first_message oldest;
    }
}

NGINX Version: nginx/1.17.10 (CentOS)

I think this might have something to do with the fact that we've migrated to a completely new server with a fresh new install of NGINX and NCHAN. The issue seems to only happen after first subscribing to a channel using a ?last_event_id= query parameter and then trying to publish a message on that channel. I suspect we're sending event IDs saved from the OLD server, which do not exist at all in the new server's NCHAN store.

Do you think this could lead to the issue I'm describing? I can't imagine that's really it, as that would mean the whole pub/sub system could be brought down by one bad subscription request. Any thoughts?

gazugafan avatar Jul 06 '20 22:07 gazugafan

I've got the same issue. Anyone any thoughts?

himulawang avatar Aug 17 '21 06:08 himulawang

I've since migrated from NCHAN to Centrifugo, which hasn't given me any such trouble. It was a mostly painless migration. Pretty close to a drop-in replacement. https://github.com/centrifugal/centrifugo

gazugafan avatar Aug 18 '21 03:08 gazugafan

... it was missing one feature, which I've decided to live without for now. Looks like they're implementing it, though! https://github.com/centrifugal/centrifugo/issues/446

gazugafan avatar Aug 18 '21 03:08 gazugafan

Same issue. We pass a last_event_id that does not exist because we want to start BEFORE our first known message ID so we can get all the data from start. If you pass the first actual last_event_id, it skips over that. but doing this causes channels to eventually lockup, while some other channels keep going. There is also a warning by doing this " Missed message for websocket subscriber". Please fix

tpneumat avatar Jan 12 '22 18:01 tpneumat