celery icon indicating copy to clipboard operation
celery copied to clipboard

Workers stop heartbeating and lose connection to broker

Open alugowski opened this issue 5 years ago • 16 comments

Problem

My workflow is about 2000 tasks that take about an hour each on an 8-core VM. Over time, the workers disconnect from the broker, stop fetching new tasks, and go idle. Once one worker does this the others follow immediately.

With RabbitMQ this would happen in about 3-ish hours. With Redis it happens about once a day on this workload.

I'm using 17 workers. More workers means this happens more frequently.

I also have a Flower set up. Flower continues working normally. If I notice a worker die I can watch the others go idle in real time on the Flower dashboard page.

Restarting workers has no effect. Restarting worker containers individually has no effect. Killing worker containers, waiting a minute, restarting worker containers has no effect. Recovery means killing everything and restarting the cluster and reseeding with tasks. Luckily I have an external task management system that makes this possible.

Full tail of logs from one worker in the comment below, but the gist looks like this:

2018-09-02 21:13:30,126 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (0/20) now.
2018-09-02 21:13:30,127 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (1/20) in 1.00 second.
Connection to Redis lost: Retry (1/20) in 1.00 second.
....
2018-09-02 21:13:49,173 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (0/20) now.
2018-09-02 21:13:49,173 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (1/20) in 1.00 second.
....
2018-09-02 21:14:27,271 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-t4c4z
missed heartbeat from celery@celery-controller-v5rgr
missed heartbeat from celery@celery-controller-t4c4z
2018-09-02 21:14:27,271 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-bnlx7
missed heartbeat from celery@celery-controller-bnlx7
.....
2018-09-02 21:14:27,274 state.py:113 - _warn_drift() WARNING: Substantial drift from celery@celery-controller-vw62p may mean clocks are out of sync. Current drift is
18 seconds. [orig: 2018-09-02 21:14:27.274446 recv: 2018-09-02 21:14:09.920693]

My setup

Celery 4.2.1 running in Kubernetes Engine using Ubuntu base images. Same results with Celery 4.11.

Currently using a Redis 4 broker. Same results with RabbitMQ 3.6 and 3.7. Contrary to other reports, in my case Redis is about an order of magnitude more reliable than RabbitMQ.

I'm using --concurrency=1. My tasks do their own threading.

I tried a few arguments out of desperation, these don't seem to have any effect:

  • -Ofair. A stack overflow post suggested this as a way to work around possible scheduling issues.
  • --max-tasks-per-child=7. This does appear to cause workers to restart, without fixing the core issue, but note that Flower shows succeeded task counts over 20. So honestly I'm not sure what this setting is doing.

alugowski avatar Sep 05 '18 17:09 alugowski

Full tail of logs from one worker as it was going idle:

2018-09-02 21:13:30,126 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (0/20) now.
2018-09-02 14:13:30.000 PDT
2018-09-02 21:13:30,127 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (1/20) in 1.00 second.
2018-09-02 14:13:30.000 PDT
Connection to Redis lost: Retry (1/20) in 1.00 second.
2018-09-02 14:13:30.034 PDT
Connection to Redis lost: Retry (0/20) now.
2018-09-02 14:13:30.127 PDT
Connection to Redis lost: Retry (1/20) in 1.00 second.
2018-09-02 14:13:31.000 PDT
2018-09-02 21:13:31,129 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (2/20) in 1.00 second.
2018-09-02 14:13:31.000 PDT
Connection to Redis lost: Retry (2/20) in 1.00 second.
2018-09-02 14:13:31.129 PDT
Connection to Redis lost: Retry (2/20) in 1.00 second.
2018-09-02 14:13:32.000 PDT
2018-09-02 21:13:32,131 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (3/20) in 1.00 second.
2018-09-02 14:13:32.000 PDT
Connection to Redis lost: Retry (3/20) in 1.00 second.
2018-09-02 14:13:32.131 PDT
Connection to Redis lost: Retry (3/20) in 1.00 second.
2018-09-02 14:13:33.000 PDT
2018-09-02 21:13:33,133 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (4/20) in 1.00 second.
2018-09-02 14:13:33.000 PDT
Connection to Redis lost: Retry (4/20) in 1.00 second.
2018-09-02 14:13:33.133 PDT
Connection to Redis lost: Retry (4/20) in 1.00 second.
2018-09-02 14:13:34.000 PDT
2018-09-02 21:13:34,136 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (5/20) in 1.00 second.
2018-09-02 14:13:34.000 PDT
Connection to Redis lost: Retry (5/20) in 1.00 second.
2018-09-02 14:13:34.136 PDT
Connection to Redis lost: Retry (5/20) in 1.00 second.
2018-09-02 14:13:35.000 PDT
Connection to Redis lost: Retry (6/20) in 1.00 second.
2018-09-02 14:13:35.000 PDT
2018-09-02 21:13:35,138 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (6/20) in 1.00 second.
2018-09-02 14:13:35.138 PDT
Connection to Redis lost: Retry (6/20) in 1.00 second.
2018-09-02 14:13:36.000 PDT
2018-09-02 21:13:36,140 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (7/20) in 1.00 second.
2018-09-02 14:13:36.000 PDT
Connection to Redis lost: Retry (7/20) in 1.00 second.
2018-09-02 14:13:36.140 PDT
Connection to Redis lost: Retry (7/20) in 1.00 second.
2018-09-02 14:13:37.000 PDT
Connection to Redis lost: Retry (8/20) in 1.00 second.
2018-09-02 14:13:37.000 PDT
2018-09-02 21:13:37,143 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (8/20) in 1.00 second.
2018-09-02 14:13:37.143 PDT
Connection to Redis lost: Retry (8/20) in 1.00 second.
2018-09-02 14:13:38.000 PDT
2018-09-02 21:13:38,145 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (9/20) in 1.00 second.
2018-09-02 14:13:38.000 PDT
Connection to Redis lost: Retry (9/20) in 1.00 second.
2018-09-02 14:13:38.145 PDT
Connection to Redis lost: Retry (9/20) in 1.00 second.
2018-09-02 14:13:39.000 PDT
2018-09-02 21:13:39,147 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (10/20) in 1.00 second.
2018-09-02 14:13:39.000 PDT
Connection to Redis lost: Retry (10/20) in 1.00 second.
2018-09-02 14:13:39.147 PDT
Connection to Redis lost: Retry (10/20) in 1.00 second.
2018-09-02 14:13:40.000 PDT
2018-09-02 21:13:40,150 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (11/20) in 1.00 second.
2018-09-02 14:13:40.000 PDT
Connection to Redis lost: Retry (11/20) in 1.00 second.
2018-09-02 14:13:40.150 PDT
Connection to Redis lost: Retry (11/20) in 1.00 second.
2018-09-02 14:13:41.000 PDT
Connection to Redis lost: Retry (12/20) in 1.00 second.
2018-09-02 14:13:41.000 PDT
2018-09-02 21:13:41,152 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (12/20) in 1.00 second.
2018-09-02 14:13:41.152 PDT
Connection to Redis lost: Retry (12/20) in 1.00 second.
2018-09-02 14:13:42.000 PDT
2018-09-02 21:13:42,154 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (13/20) in 1.00 second.
2018-09-02 14:13:42.000 PDT
Connection to Redis lost: Retry (13/20) in 1.00 second.
2018-09-02 14:13:42.154 PDT
Connection to Redis lost: Retry (13/20) in 1.00 second.
2018-09-02 14:13:43.000 PDT
2018-09-02 21:13:43,157 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (14/20) in 1.00 second.
2018-09-02 14:13:43.000 PDT
Connection to Redis lost: Retry (14/20) in 1.00 second.
2018-09-02 14:13:43.157 PDT
Connection to Redis lost: Retry (14/20) in 1.00 second.
2018-09-02 14:13:44.000 PDT
2018-09-02 21:13:44,159 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (15/20) in 1.00 second.
2018-09-02 14:13:44.000 PDT
Connection to Redis lost: Retry (15/20) in 1.00 second.
2018-09-02 14:13:44.159 PDT
Connection to Redis lost: Retry (15/20) in 1.00 second.
2018-09-02 14:13:45.000 PDT
2018-09-02 21:13:45,161 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (16/20) in 1.00 second.
2018-09-02 14:13:45.000 PDT
Connection to Redis lost: Retry (16/20) in 1.00 second.
2018-09-02 14:13:45.161 PDT
Connection to Redis lost: Retry (16/20) in 1.00 second.
2018-09-02 14:13:46.000 PDT
2018-09-02 21:13:46,163 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (17/20) in 1.00 second.
2018-09-02 14:13:46.000 PDT
Connection to Redis lost: Retry (17/20) in 1.00 second.
2018-09-02 14:13:46.163 PDT
Connection to Redis lost: Retry (17/20) in 1.00 second.
2018-09-02 14:13:47.000 PDT
2018-09-02 21:13:47,166 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (18/20) in 1.00 second.
2018-09-02 14:13:47.000 PDT
Connection to Redis lost: Retry (18/20) in 1.00 second.
2018-09-02 14:13:47.166 PDT
Connection to Redis lost: Retry (18/20) in 1.00 second.
2018-09-02 14:13:48.000 PDT
2018-09-02 21:13:48,168 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (19/20) in 1.00 second.
2018-09-02 14:13:48.000 PDT
Connection to Redis lost: Retry (19/20) in 1.00 second.
2018-09-02 14:13:48.168 PDT
Connection to Redis lost: Retry (19/20) in 1.00 second.
2018-09-02 14:13:49.000 PDT
Connection to Redis lost: Retry (0/20) now.
2018-09-02 14:13:49.000 PDT
2018-09-02 21:13:49,173 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (0/20) now.
2018-09-02 14:13:49.000 PDT
2018-09-02 21:13:49,173 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (1/20) in 1.00 second.
2018-09-02 14:13:49.000 PDT
Connection to Redis lost: Retry (1/20) in 1.00 second.
2018-09-02 14:13:49.173 PDT
Connection to Redis lost: Retry (0/20) now.
2018-09-02 14:13:49.173 PDT
Connection to Redis lost: Retry (1/20) in 1.00 second.
2018-09-02 14:13:50.000 PDT
2018-09-02 21:13:50,176 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (2/20) in 1.00 second.
2018-09-02 14:13:50.000 PDT
Connection to Redis lost: Retry (2/20) in 1.00 second.
2018-09-02 14:13:50.176 PDT
Connection to Redis lost: Retry (2/20) in 1.00 second.
2018-09-02 14:13:51.000 PDT
2018-09-02 21:13:51,178 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (3/20) in 1.00 second.
2018-09-02 14:13:51.000 PDT
Connection to Redis lost: Retry (3/20) in 1.00 second.
2018-09-02 14:13:51.178 PDT
Connection to Redis lost: Retry (3/20) in 1.00 second.
2018-09-02 14:13:52.000 PDT
2018-09-02 21:13:52,180 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (4/20) in 1.00 second.
2018-09-02 14:13:52.000 PDT
Connection to Redis lost: Retry (4/20) in 1.00 second.
2018-09-02 14:13:52.180 PDT
Connection to Redis lost: Retry (4/20) in 1.00 second.
2018-09-02 14:13:53.000 PDT
2018-09-02 21:13:53,183 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (5/20) in 1.00 second.
2018-09-02 14:13:53.000 PDT
Connection to Redis lost: Retry (5/20) in 1.00 second.
2018-09-02 14:13:53.183 PDT
Connection to Redis lost: Retry (5/20) in 1.00 second.
2018-09-02 14:13:54.000 PDT
2018-09-02 21:13:54,185 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (6/20) in 1.00 second.
2018-09-02 14:13:54.000 PDT
Connection to Redis lost: Retry (6/20) in 1.00 second.
2018-09-02 14:13:54.185 PDT
Connection to Redis lost: Retry (6/20) in 1.00 second.
2018-09-02 14:13:55.000 PDT
2018-09-02 21:13:55,188 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (7/20) in 1.00 second.
2018-09-02 14:13:55.000 PDT
Connection to Redis lost: Retry (7/20) in 1.00 second.
2018-09-02 14:13:55.188 PDT
Connection to Redis lost: Retry (7/20) in 1.00 second.
2018-09-02 14:13:56.000 PDT
2018-09-02 21:13:56,190 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (8/20) in 1.00 second.
2018-09-02 14:13:56.000 PDT
Connection to Redis lost: Retry (8/20) in 1.00 second.
2018-09-02 14:13:56.190 PDT
Connection to Redis lost: Retry (8/20) in 1.00 second.
2018-09-02 14:13:57.000 PDT
2018-09-02 21:13:57,193 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (9/20) in 1.00 second.
2018-09-02 14:13:57.000 PDT
Connection to Redis lost: Retry (9/20) in 1.00 second.
2018-09-02 14:13:57.193 PDT
Connection to Redis lost: Retry (9/20) in 1.00 second.
2018-09-02 14:13:58.000 PDT
2018-09-02 21:13:58,195 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (10/20) in 1.00 second.
2018-09-02 14:13:58.000 PDT
Connection to Redis lost: Retry (10/20) in 1.00 second.
2018-09-02 14:13:58.195 PDT
Connection to Redis lost: Retry (10/20) in 1.00 second.
2018-09-02 14:13:59.000 PDT
2018-09-02 21:13:59,198 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (11/20) in 1.00 second.
2018-09-02 14:13:59.000 PDT
Connection to Redis lost: Retry (11/20) in 1.00 second.
2018-09-02 14:13:59.198 PDT
Connection to Redis lost: Retry (11/20) in 1.00 second.
2018-09-02 14:14:00.000 PDT
2018-09-02 21:14:00,200 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (12/20) in 1.00 second.
2018-09-02 14:14:00.000 PDT
Connection to Redis lost: Retry (12/20) in 1.00 second.
2018-09-02 14:14:00.200 PDT
Connection to Redis lost: Retry (12/20) in 1.00 second.
2018-09-02 14:14:01.000 PDT
Connection to Redis lost: Retry (13/20) in 1.00 second.
2018-09-02 14:14:01.000 PDT
2018-09-02 21:14:01,203 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (13/20) in 1.00 second.
2018-09-02 14:14:01.203 PDT
Connection to Redis lost: Retry (13/20) in 1.00 second.
2018-09-02 14:14:02.000 PDT
2018-09-02 21:14:02,205 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (14/20) in 1.00 second.
2018-09-02 14:14:02.000 PDT
Connection to Redis lost: Retry (14/20) in 1.00 second.
2018-09-02 14:14:02.000 PDT
2018-09-02 21:14:02,470 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-k2lnb
2018-09-02 14:14:02.000 PDT
missed heartbeat from celery@celery-controller-k2lnb
2018-09-02 14:14:02.205 PDT
Connection to Redis lost: Retry (14/20) in 1.00 second.
2018-09-02 14:14:02.470 PDT
missed heartbeat from celery@celery-controller-k2lnb
2018-09-02 14:14:03.000 PDT
2018-09-02 21:14:03,208 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (15/20) in 1.00 second.
2018-09-02 14:14:03.000 PDT
Connection to Redis lost: Retry (15/20) in 1.00 second.
2018-09-02 14:14:03.208 PDT
Connection to Redis lost: Retry (15/20) in 1.00 second.
2018-09-02 14:14:04.000 PDT
2018-09-02 21:14:04,210 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (16/20) in 1.00 second.
2018-09-02 14:14:04.000 PDT
Connection to Redis lost: Retry (16/20) in 1.00 second.
2018-09-02 14:14:04.210 PDT
Connection to Redis lost: Retry (16/20) in 1.00 second.
2018-09-02 14:14:05.000 PDT
2018-09-02 21:14:05,212 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (17/20) in 1.00 second.
2018-09-02 14:14:05.000 PDT
Connection to Redis lost: Retry (17/20) in 1.00 second.
2018-09-02 14:14:05.212 PDT
Connection to Redis lost: Retry (17/20) in 1.00 second.
2018-09-02 14:14:06.000 PDT
2018-09-02 21:14:06,215 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (18/20) in 1.00 second.
2018-09-02 14:14:06.000 PDT
Connection to Redis lost: Retry (18/20) in 1.00 second.
2018-09-02 14:14:06.215 PDT
Connection to Redis lost: Retry (18/20) in 1.00 second.
2018-09-02 14:14:07.000 PDT
2018-09-02 21:14:07,217 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (19/20) in 1.00 second.
2018-09-02 14:14:07.000 PDT
Connection to Redis lost: Retry (19/20) in 1.00 second.
2018-09-02 14:14:07.217 PDT
Connection to Redis lost: Retry (19/20) in 1.00 second.
2018-09-02 14:14:08.000 PDT
2018-09-02 21:14:08,223 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (0/20) now.
2018-09-02 14:14:08.000 PDT
Connection to Redis lost: Retry (0/20) now.
2018-09-02 14:14:08.000 PDT
2018-09-02 21:14:08,224 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (1/20) in 1.00 second.
2018-09-02 14:14:08.000 PDT
Connection to Redis lost: Retry (1/20) in 1.00 second.
2018-09-02 14:14:08.223 PDT
Connection to Redis lost: Retry (0/20) now.
2018-09-02 14:14:08.224 PDT
Connection to Redis lost: Retry (1/20) in 1.00 second.
2018-09-02 14:14:09.000 PDT
2018-09-02 21:14:09,226 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (2/20) in 1.00 second.
2018-09-02 14:14:09.000 PDT
Connection to Redis lost: Retry (2/20) in 1.00 second.
2018-09-02 14:14:09.226 PDT
Connection to Redis lost: Retry (2/20) in 1.00 second.
2018-09-02 14:14:10.000 PDT
2018-09-02 21:14:10,228 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (3/20) in 1.00 second.
2018-09-02 14:14:10.000 PDT
Connection to Redis lost: Retry (3/20) in 1.00 second.
2018-09-02 14:14:10.228 PDT
Connection to Redis lost: Retry (3/20) in 1.00 second.
2018-09-02 14:14:11.000 PDT
2018-09-02 21:14:11,231 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (4/20) in 1.00 second.
2018-09-02 14:14:11.000 PDT
Connection to Redis lost: Retry (4/20) in 1.00 second.
2018-09-02 14:14:11.231 PDT
Connection to Redis lost: Retry (4/20) in 1.00 second.
2018-09-02 14:14:12.000 PDT
2018-09-02 21:14:12,233 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (5/20) in 1.00 second.
2018-09-02 14:14:12.000 PDT
Connection to Redis lost: Retry (5/20) in 1.00 second.
2018-09-02 14:14:12.233 PDT
Connection to Redis lost: Retry (5/20) in 1.00 second.
2018-09-02 14:14:13.000 PDT
Connection to Redis lost: Retry (6/20) in 1.00 second.
2018-09-02 14:14:13.000 PDT
2018-09-02 21:14:13,235 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (6/20) in 1.00 second.
2018-09-02 14:14:13.235 PDT
Connection to Redis lost: Retry (6/20) in 1.00 second.
2018-09-02 14:14:14.000 PDT
2018-09-02 21:14:14,238 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (7/20) in 1.00 second.
2018-09-02 14:14:14.000 PDT
Connection to Redis lost: Retry (7/20) in 1.00 second.
2018-09-02 14:14:14.238 PDT
Connection to Redis lost: Retry (7/20) in 1.00 second.
2018-09-02 14:14:15.000 PDT
Connection to Redis lost: Retry (8/20) in 1.00 second.
2018-09-02 14:14:15.000 PDT
2018-09-02 21:14:15,240 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (8/20) in 1.00 second.
2018-09-02 14:14:15.240 PDT
Connection to Redis lost: Retry (8/20) in 1.00 second.
2018-09-02 14:14:16.000 PDT
2018-09-02 21:14:16,242 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (9/20) in 1.00 second.
2018-09-02 14:14:16.000 PDT
Connection to Redis lost: Retry (9/20) in 1.00 second.
2018-09-02 14:14:16.242 PDT
Connection to Redis lost: Retry (9/20) in 1.00 second.
2018-09-02 14:14:17.000 PDT
2018-09-02 21:14:17,245 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (10/20) in 1.00 second.
2018-09-02 14:14:17.000 PDT
Connection to Redis lost: Retry (10/20) in 1.00 second.
2018-09-02 14:14:17.245 PDT
Connection to Redis lost: Retry (10/20) in 1.00 second.
2018-09-02 14:14:18.000 PDT
Connection to Redis lost: Retry (11/20) in 1.00 second.
2018-09-02 14:14:18.000 PDT
2018-09-02 21:14:18,247 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (11/20) in 1.00 second.
2018-09-02 14:14:18.247 PDT
Connection to Redis lost: Retry (11/20) in 1.00 second.
2018-09-02 14:14:19.000 PDT
2018-09-02 21:14:19,249 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (12/20) in 1.00 second.
2018-09-02 14:14:19.000 PDT
Connection to Redis lost: Retry (12/20) in 1.00 second.
2018-09-02 14:14:19.249 PDT
Connection to Redis lost: Retry (12/20) in 1.00 second.
2018-09-02 14:14:20.000 PDT
2018-09-02 21:14:20,252 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (13/20) in 1.00 second.
2018-09-02 14:14:20.000 PDT
Connection to Redis lost: Retry (13/20) in 1.00 second.
2018-09-02 14:14:20.252 PDT
Connection to Redis lost: Retry (13/20) in 1.00 second.
2018-09-02 14:14:21.000 PDT
2018-09-02 21:14:21,254 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (14/20) in 1.00 second.
2018-09-02 14:14:21.000 PDT
Connection to Redis lost: Retry (14/20) in 1.00 second.
2018-09-02 14:14:21.254 PDT
Connection to Redis lost: Retry (14/20) in 1.00 second.
2018-09-02 14:14:22.000 PDT
2018-09-02 21:14:22,256 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (15/20) in 1.00 second.
2018-09-02 14:14:22.000 PDT
Connection to Redis lost: Retry (15/20) in 1.00 second.
2018-09-02 14:14:22.256 PDT
Connection to Redis lost: Retry (15/20) in 1.00 second.
2018-09-02 14:14:23.000 PDT
2018-09-02 21:14:23,259 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (16/20) in 1.00 second.
2018-09-02 14:14:23.000 PDT
Connection to Redis lost: Retry (16/20) in 1.00 second.
2018-09-02 14:14:23.259 PDT
Connection to Redis lost: Retry (16/20) in 1.00 second.
2018-09-02 14:14:24.000 PDT
2018-09-02 21:14:24,261 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (17/20) in 1.00 second.
2018-09-02 14:14:24.000 PDT
Connection to Redis lost: Retry (17/20) in 1.00 second.
2018-09-02 14:14:24.261 PDT
Connection to Redis lost: Retry (17/20) in 1.00 second.
2018-09-02 14:14:25.000 PDT
2018-09-02 21:14:25,264 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (18/20) in 1.00 second.
2018-09-02 14:14:25.000 PDT
Connection to Redis lost: Retry (18/20) in 1.00 second.
2018-09-02 14:14:25.264 PDT
Connection to Redis lost: Retry (18/20) in 1.00 second.
2018-09-02 14:14:26.000 PDT
2018-09-02 21:14:26,266 redis.py:285 - on_connection_error() ERROR: Connection to Redis lost: Retry (19/20) in 1.00 second.
2018-09-02 14:14:26.000 PDT
Connection to Redis lost: Retry (19/20) in 1.00 second.
2018-09-02 14:14:26.266 PDT
Connection to Redis lost: Retry (19/20) in 1.00 second.
2018-09-02 14:14:27.000 PDT
Pool callback raised exception: ConnectionError('Error 111 connecting to localhost:6379. Connection refused.',)
2018-09-02 14:14:27.000 PDT
Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 484, in connect sock = self._connect() File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 541, in _connect raise err File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 529, in _connect sock.connect(socket_address) ConnectionRefusedError: [Errno 111] Connection refused
2018-09-02 14:14:27.000 PDT
2018-09-02 14:14:27.000 PDT
During handling of the above exception, another exception occurred:
2018-09-02 14:14:27.000 PDT
2018-09-02 14:14:27.000 PDT
Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/redis/client.py", line 2879, in execute return execute(conn, stack, raise_on_error) File "/usr/local/lib/python3.6/dist-packages/redis/client.py", line 2749, in _execute_transaction connection.send_packed_command(all_cmds) File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 585, in send_packed_command self.connect() File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 489, in connect raise ConnectionError(self._error_message(e)) redis.exceptions.ConnectionError: Error 111 connecting to localhost:6379. Connection refused.
2018-09-02 14:14:27.000 PDT
2018-09-02 14:14:27.000 PDT
During handling of the above exception, another exception occurred:
2018-09-02 14:14:27.000 PDT
2018-09-02 14:14:27.000 PDT
Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 484, in connect sock = self._connect() File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 541, in _connect raise err File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 529, in _connect sock.connect(socket_address) ConnectionRefusedError: [Errno 111] Connection refused
2018-09-02 14:14:27.000 PDT
2018-09-02 14:14:27.000 PDT
During handling of the above exception, another exception occurred:
2018-09-02 14:14:27.000 PDT
2018-09-02 14:14:27.000 PDT
Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/billiard/pool.py", line 1747, in safe_apply_callback fun(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/celery/worker/request.py", line 367, in on_failure self.id, exc, request=self, store_result=self.store_errors, File "/usr/local/lib/python3.6/dist-packages/celery/backends/base.py", line 157, in mark_as_failure traceback=traceback, request=request) File "/usr/local/lib/python3.6/dist-packages/celery/backends/base.py", line 322, in store_result request=request, **kwargs) File "/usr/local/lib/python3.6/dist-packages/celery/backends/base.py", line 664, in _store_result self.set(self.get_key_for_task(task_id), self.encode(meta)) File "/usr/local/lib/python3.6/dist-packages/celery/backends/redis.py", line 289, in set return self.ensure(self._set, (key, value), **retry_policy) File "/usr/local/lib/python3.6/dist-packages/celery/backends/redis.py", line 279, in ensure **retry_policy) File "/usr/local/lib/python3.6/dist-packages/kombu/utils/functional.py", line 332, in retry_over_time return fun(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/celery/backends/redis.py", line 298, in _set pipe.execute() File "/usr/local/lib/python3.6/dist-packages/redis/client.py", line 2894, in execute return execute(conn, stack, raise_on_error) File "/usr/local/lib/python3.6/dist-packages/redis/client.py", line 2749, in _execute_transaction connection.send_packed_command(all_cmds) File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 585, in send_packed_command self.connect() File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 489, in connect raise ConnectionError(self._error_message(e)) redis.exceptions.ConnectionError: Error 111 connecting to localhost:6379. Connection refused.
2018-09-02 14:14:27.000 PDT
2018-09-02 21:14:27,271 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-xlcnd
2018-09-02 14:14:27.000 PDT
2018-09-02 21:14:27,271 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-v5rgr
2018-09-02 14:14:27.000 PDT
missed heartbeat from celery@celery-controller-xlcnd
2018-09-02 14:14:27.000 PDT
2018-09-02 21:14:27,271 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-t4c4z
2018-09-02 14:14:27.000 PDT
missed heartbeat from celery@celery-controller-v5rgr
2018-09-02 14:14:27.000 PDT
missed heartbeat from celery@celery-controller-t4c4z
2018-09-02 14:14:27.000 PDT
2018-09-02 21:14:27,271 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-bnlx7
2018-09-02 14:14:27.000 PDT
missed heartbeat from celery@celery-controller-bnlx7
2018-09-02 14:14:27.000 PDT
2018-09-02 21:14:27,271 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-vhjt9
2018-09-02 14:14:27.000 PDT
missed heartbeat from celery@celery-controller-vhjt9
2018-09-02 14:14:27.000 PDT
missed heartbeat from celery@celery-controller-pz55n
2018-09-02 14:14:27.000 PDT
2018-09-02 21:14:27,271 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-pz55n
2018-09-02 14:14:27.000 PDT
missed heartbeat from celery@celery-controller-sg7g5
2018-09-02 14:14:27.000 PDT
missed heartbeat from celery@celery-controller-9xjlt
2018-09-02 14:14:27.000 PDT
2018-09-02 21:14:27,271 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-sg7g5
2018-09-02 14:14:27.000 PDT
2018-09-02 21:14:27,272 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-9xjlt
2018-09-02 14:14:27.000 PDT
2018-09-02 21:14:27,272 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-95sd5
2018-09-02 14:14:27.000 PDT
2018-09-02 21:14:27,272 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-852hl
2018-09-02 14:14:27.000 PDT
missed heartbeat from celery@celery-controller-95sd5
2018-09-02 14:14:27.000 PDT
missed heartbeat from celery@celery-controller-852hl
2018-09-02 14:14:27.000 PDT
2018-09-02 21:14:27,272 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-f24mj
2018-09-02 14:14:27.000 PDT
missed heartbeat from celery@celery-controller-f24mj
2018-09-02 14:14:27.000 PDT
2018-09-02 21:14:27,272 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-sc89j
2018-09-02 14:14:27.000 PDT
missed heartbeat from celery@celery-controller-sc89j
2018-09-02 14:14:27.000 PDT
2018-09-02 21:14:27,272 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-nfswr
2018-09-02 14:14:27.000 PDT
missed heartbeat from celery@celery-controller-nfswr
2018-09-02 14:14:27.000 PDT
2018-09-02 21:14:27,272 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-9fjjk
2018-09-02 14:14:27.000 PDT
2018-09-02 21:14:27,272 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-vw62p
2018-09-02 14:14:27.000 PDT
missed heartbeat from celery@celery-controller-9fjjk
2018-09-02 14:14:27.000 PDT
missed heartbeat from celery@celery-controller-vw62p
2018-09-02 14:14:27.000 PDT
2018-09-02 21:14:27,274 state.py:113 - _warn_drift() WARNING: Substantial drift from celery@celery-controller-vw62p may mean clocks are out of sync. Current drift is
2018-09-02 14:14:27.000 PDT
18 seconds. [orig: 2018-09-02 21:14:27.274446 recv: 2018-09-02 21:14:09.920693]
2018-09-02 14:14:27.000 PDT
2018-09-02 14:14:27.000 PDT
Substantial drift from celery@celery-controller-vw62p may mean clocks are out of sync. Current drift is
2018-09-02 14:14:27.000 PDT
18 seconds. [orig: 2018-09-02 21:14:27.274446 recv: 2018-09-02 21:14:09.920693]
2018-09-02 14:14:27.000 PDT
2018-09-02 14:14:27.271 PDT
missed heartbeat from celery@celery-controller-xlcnd
2018-09-02 14:14:27.271 PDT
missed heartbeat from celery@celery-controller-v5rgr
2018-09-02 14:14:27.271 PDT
missed heartbeat from celery@celery-controller-t4c4z
2018-09-02 14:14:27.271 PDT
missed heartbeat from celery@celery-controller-bnlx7
2018-09-02 14:14:27.271 PDT
missed heartbeat from celery@celery-controller-vhjt9
2018-09-02 14:14:27.271 PDT
missed heartbeat from celery@celery-controller-pz55n
2018-09-02 14:14:27.271 PDT
missed heartbeat from celery@celery-controller-sg7g5
2018-09-02 14:14:27.272 PDT
missed heartbeat from celery@celery-controller-9xjlt
2018-09-02 14:14:27.272 PDT
missed heartbeat from celery@celery-controller-95sd5
2018-09-02 14:14:27.272 PDT
missed heartbeat from celery@celery-controller-852hl
2018-09-02 14:14:27.272 PDT
missed heartbeat from celery@celery-controller-f24mj
2018-09-02 14:14:27.272 PDT
missed heartbeat from celery@celery-controller-sc89j
2018-09-02 14:14:27.272 PDT
missed heartbeat from celery@celery-controller-nfswr
2018-09-02 14:14:27.272 PDT
missed heartbeat from celery@celery-controller-9fjjk
2018-09-02 14:14:27.272 PDT
missed heartbeat from celery@celery-controller-vw62p
2018-09-02 14:14:27.274 PDT
Substantial drift from celery@celery-controller-vw62p may mean clocks are out of sync. Current drift is 18 seconds. [orig: 2018-09-02 21:14:27.274446 recv: 2018-09-02 21:14:09.920693]
2018-09-02 14:19:52.000 PDT
2018-09-02 21:19:52,339 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-bnlx7
2018-09-02 14:19:52.000 PDT
missed heartbeat from celery@celery-controller-bnlx7
2018-09-02 14:19:52.339 PDT
missed heartbeat from celery@celery-controller-bnlx7
2018-09-02 14:23:47.000 PDT
2018-09-02 21:23:47,391 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-f24mj
2018-09-02 14:23:47.000 PDT
missed heartbeat from celery@celery-controller-f24mj
2018-09-02 14:23:47.391 PDT
missed heartbeat from celery@celery-controller-f24mj
2018-09-02 14:24:37.000 PDT
2018-09-02 21:24:37,403 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-nfswr
2018-09-02 14:24:37.000 PDT
missed heartbeat from celery@celery-controller-nfswr
2018-09-02 14:24:37.403 PDT
missed heartbeat from celery@celery-controller-nfswr
2018-09-02 14:29:42.000 PDT
2018-09-02 21:29:42,462 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-t4c4z
2018-09-02 14:29:42.000 PDT
missed heartbeat from celery@celery-controller-t4c4z
2018-09-02 14:29:42.462 PDT
missed heartbeat from celery@celery-controller-t4c4z
2018-09-02 14:31:52.000 PDT
missed heartbeat from celery@celery-controller-xlcnd
2018-09-02 14:31:52.000 PDT
2018-09-02 21:31:52,551 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-xlcnd
2018-09-02 14:31:52.551 PDT
missed heartbeat from celery@celery-controller-xlcnd
2018-09-02 14:32:32.000 PDT
2018-09-02 21:32:32,561 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-sc89j
2018-09-02 14:32:32.000 PDT
missed heartbeat from celery@celery-controller-sc89j
2018-09-02 14:32:32.561 PDT
missed heartbeat from celery@celery-controller-sc89j
2018-09-02 14:37:07.000 PDT
2018-09-02 21:37:07,649 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-vhjt9
2018-09-02 14:37:07.000 PDT
missed heartbeat from celery@celery-controller-vhjt9
2018-09-02 14:37:07.649 PDT
missed heartbeat from celery@celery-controller-vhjt9
2018-09-02 14:40:27.000 PDT
2018-09-02 21:40:27,686 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-v5rgr
2018-09-02 14:40:27.000 PDT
missed heartbeat from celery@celery-controller-v5rgr
2018-09-02 14:40:27.686 PDT
missed heartbeat from celery@celery-controller-v5rgr
2018-09-02 14:44:22.000 PDT
2018-09-02 21:44:22,760 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-sg7g5
2018-09-02 14:44:22.000 PDT
missed heartbeat from celery@celery-controller-sg7g5
2018-09-02 14:44:22.760 PDT
missed heartbeat from celery@celery-controller-sg7g5
2018-09-02 14:44:27.000 PDT
2018-09-02 21:44:27,761 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-852hl
2018-09-02 14:44:27.000 PDT
missed heartbeat from celery@celery-controller-852hl
2018-09-02 14:44:27.761 PDT
missed heartbeat from celery@celery-controller-852hl
2018-09-02 14:52:22.000 PDT
2018-09-02 21:52:22,875 gossip.py:147 - on_node_lost() INFO: missed heartbeat from celery@celery-controller-pz55n
2018-09-02 14:52:22.000 PDT
missed heartbeat from celery@celery-controller-pz55n
2018-09-02 14:52:22.875 PDT
missed heartbeat from celery@celery-controller-pz55n

alugowski avatar Sep 05 '18 17:09 alugowski

I've seen other issues that show a similar pattern: high concurrency causes disconnected workers. Others also reference heartbeat issues.

The high concurrency and intermittent nature of the problem makes this sound like a race condition. Most likely it's in the heartbeat mechanism.

alugowski avatar Sep 05 '18 17:09 alugowski

I would love a way to turn heartbeats off. My usecase of long-running tasks doesn't benefit from them. Simple dead-man-switch to the broker should be enough. Even absent that, just manually redoing undone tasks once at the end would be fine.

alugowski avatar Sep 05 '18 17:09 alugowski

Following is the last logs from one of the workers which keeps stalling every ~6 hours. The other workers are active, consuming tasks and executing without any issues. All the workers are hosted in a single EC2 instance and connected to a elasticache redis node. They all use different queues and exchanges configured via settings.

[2018-09-05 10:36:30,007: INFO/MainProcess] missed heartbeat from worker1@ip-######
[2018-09-05 10:36:30,007: INFO/MainProcess] missed heartbeat from worker2@ip-######
[2018-09-05 10:36:30,007: INFO/MainProcess] missed heartbeat from worker3@ip-######
[2018-09-05 10:36:30,007: INFO/MainProcess] missed heartbeat from worker4@ip-######

Celery worker configuration

/opt/python/run/venv/bin/celery worker
                  --app=myapp.celery_green
                  --queues=queue5
                  --loglevel=INFO
                  --pool=gevent
                  --concurrency=100
                  --hostname=worker5@%%h
                  --task-events
                  -Ofair

where celery_green is nothing but a gevent patched celery app module and all the other workers uses the same configuration except the different queues and hostnames.

Python 3.6 Celery 4.2.0 (Tried with 4.2.1 too) Elasticache Redis(4) broker

babus avatar Sep 05 '18 18:09 babus

@babu-upcomer Are you also seeing a clock drift log message?

alugowski avatar Sep 05 '18 19:09 alugowski

@alugowski No I am not seeing such messages. Have added --without-gossip to see if it happens again.

babus avatar Sep 06 '18 10:09 babus

Update: It doesn't help by adding --without-gossip. @alugowski Got any update? Did you try with --without-mingle so that we can avoid the heartbeat? Or if it is about the concurrency will it elevate if the no of concurrent tasks are lowered?

babus avatar Sep 07 '18 19:09 babus

@babu-upcomer I'm trying with --without-mingle now. Immediately upon startup I see missed heartbeat messages, so I don't know if this setting works. I also see many Redis disconnection and timeout messages, but eventually it seems to pick up a task. Let's see if it continues to function over time.

I'm using --concurrency=1 with my own threading.

alugowski avatar Sep 07 '18 21:09 alugowski

@babu-upcomer I'm trying with --without-mingle now. Immediately upon startup I see missed heartbeat messages, so I don't know if this setting works. I also see many Redis disconnection and timeout messages, but eventually it seems to pick up a task. Let's see if it continues to function over time.

I'm using --concurrency=1 with my own threading.

Try strace -p your main process pid

In my similar case

open("/proc/loadavg", O_RDONLY) = 22 read(22, "2.98 2.67 3.03 7/3453 20895\n", 64) = 28 close(22) = 0 clock_gettime(CLOCK_REALTIME, {1542964748, 327718695}) = 0 *getrandom("w}~\360\3X\356\30\216u\361D>\r}0", 16, 0) = 16 sendto(18, "3\r\n$7\r\nPUBLISH\r\n$28\r\n/4.celerye"..., 820, 0, NULL, 0) = 820 recvfrom(18, 0x2afcd60, 65536, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) epoll_ctl(4, EPOLL_CTL_ADD, 18, {EPOLLIN, {u32=18, u64=12884901906}}) = -1 EEXIST (File exists) clock_gettime(CLOCK_MONOTONIC, {2185098, 971929426}) = 0

some buddies say redis socket blocks in celery process

yuklin avatar Nov 23 '18 10:11 yuklin

Hi, there. I met the same problem with celery+Rabbitmq, which made me stuck for whole day. I tried three options "--without-gossip --without-mingle --without-heartbeat" together and it actually worked on my machines. After that, I found out the heartbeat between celery workers is transmitted by broker(for me should be rabbitmq and for you is redis). However, if the workers cannot receive others heartbeat for a couple of times, worker will think there's a connection issue between it with broker and it will try to reconnect with broker.

I didn't figure out why the worker missed heartbeat, which could cause by broker which doesn't transmit it. But, the "--without-heartbeat" actually worked here.

zyn-sec avatar Jul 03 '19 09:07 zyn-sec

Was there any resolution on this other than adding --without-heartbeat? Currently hitting this issue with Celery workers in EKS + ElastiCache (redis)

rahulsmehta avatar Mar 12 '21 16:03 rahulsmehta

I am having the same problem with ElastiCache redis the worker stops receiving tasks randomly, my worker is started with the following args: celery -A myapp worker -l info --without-heartbeat

farahats9 avatar Sep 06 '21 14:09 farahats9

i faced the same issue and i want to understand any negative impact if we disable this heartbeat with --without-heartbeat

ericazhao-china avatar Dec 29 '22 10:12 ericazhao-china

I have tried --without-gossip --without-mingle --without-heartbeat which does not work in kubernetes. RabbitMQ still kicks out my worker. This is so frustrating.

SuperMasterBlasterLaser avatar Jan 27 '23 15:01 SuperMasterBlasterLaser

any updates?

WellingtonNico avatar Mar 17 '23 20:03 WellingtonNico

I'd like to add that we were not experiencing this issue until we switched to using gevent.

We're using Redis.

otterbotter avatar Jul 19 '23 14:07 otterbotter