kombu icon indicating copy to clipboard operation
kombu copied to clipboard

ETA tasks lost on worker TERM restart

Open safwanebrahim-uc opened this issue 9 years ago • 13 comments

I have clusters of workers listening to a redis server. When i do restart worker(using salt stack), the running tasks will comple execution since its a TERM signal, but all the scheduled tasks (ETA set) are lost

safwanebrahim-uc avatar Oct 27 '15 11:10 safwanebrahim-uc

Do you have any steps to reproduce this?

What does the worker log say at shutdown?

ask avatar Oct 27 '15 17:10 ask

I have around 20 instances running as a cluster, consume from a redis server located in a separate instance. I use saltstack to deploy things on these 20 instances. Will be doing worker restart as well along with deploys.

This is an instance celery log [2015-10-27 11:25:11,889: INFO/MainProcess] mingle: searching for neighbors [2015-10-27 11:25:12,896: INFO/MainProcess] mingle: all alone

and it seems the instances do not sync each other since worker restart happenig in same point in time.

Problem found is that when we restart worker during execution of a task, all the scheduled task will be lost from the queue

I use supervisor as daemon process for worker process

For example. Test1 and Test2 are two consumer instances of a queue. i have added two tasks, T1(countdown = 60), T2(countdown = 86400).

I just restarted the worker using saltstack, now both the tasks are there in Test2


worker: Warm shutdown (MainProcess) [2015-10-27 11:25:09,598: WARNING/MainProcess] Restoring 1 unacknowledged message(s). [2015-10-27 11:25:11,861: WARNING/MainProcess] /usr/local/lib/python2.7/dist-packages/celery/apps/worker.py:161: CDeprecationWarning: Starting from version 3.2 Celery will refuse to accept pickle by default.

The pickle serializer is a security concern as it may give attackers the ability to execute any command. It's important to secure your broker from unauthorized access when using pickle, so we think that enabling pickle should require a deliberate action and not be the default choice.

If you depend on pickle then you should set a setting to disable this warning and to be sure that everything will continue working when you upgrade to Celery 3.2::

CELERY_ACCEPT_CONTENT = ['pickle', 'json', 'msgpack', 'yaml']

You must only enable the serializers that you will actually use.

warnings.warn(CDeprecationWarning(W_PICKLE_DEPRECATED)) [2015-10-27 11:25:11,881: INFO/MainProcess] Connected to redis://10.83.73.169:9875/0 [2015-10-27 11:25:11,889: INFO/MainProcess] mingle: searching for neighbors [2015-10-27 11:25:12,896: INFO/MainProcess] mingle: all alone [2015-10-27 11:25:12,913: WARNING/MainProcess] [email protected] ready. [2015-10-27 11:25:13,127: INFO/MainProcess] Received task: taskXXX.Task_XXX[134feb23-b82f-4cd2-8974-2a6b150a9386] eta:[2016-02-04 12:25:58.399643+00:00] [2015-10-27 11:25:13,129: INFO/MainProcess] Received task: task_XXX.Task_XXX[80c608c3-b7fa-4318-8df9-73e435df514d] eta:[2015-10-27 11:26:19.881562+00:00]


i restarted the supervisor when a task T1 (scheduled at 2015-10-27 11:26:19.881562+00:00 ) is running

The running task finishes the execution as i see my database since it does some updates in my tables , and the task T2 (scheduled to run at 2016-02-04 12:25:58.399643+00:00) is not found in the queue (not restored by either Test1 or Test2)

safwanebrahim-uc avatar Oct 28 '15 13:10 safwanebrahim-uc

Start the worker with -l debug please, to get more logging.

ask avatar Oct 28 '15 19:10 ask

Same here in 2019

scythargon avatar Dec 23 '19 20:12 scythargon

Having this issue with RabbitMQ as well

nathanielastudillo avatar Jan 21 '20 01:01 nathanielastudillo

celery n kombu version?

auvipy avatar Jan 28 '20 13:01 auvipy

Also having this issue with a task ETA that is scheduled to occur after a proc restart

kombu==4.6.4
celery==4.3.0

mkuchen avatar Mar 06 '20 16:03 mkuchen

@auvipy so just to check, is this solved in some version? Or is it an ongoing issue?

gdvalderrama avatar Dec 12 '22 11:12 gdvalderrama

I don't know! need some one to verify in production using the latest development branch. but most probably no

auvipy avatar Dec 14 '22 12:12 auvipy

I’m observing the same issue in production with a very simple setup: Single worker, single task, redis backend, unlimited auto-retry, acks_late. Versions are:

kombu==5.2.4 celery==5.2.7

kombu said

Restoring xxx unacknowledged message(s)

in the logs. But the Redis DB is empty now.

torotil avatar May 15 '23 13:05 torotil

I can reproduce this very easily in my local dev setup (vagrant box) and might help to track it down.

In [9]: worker.APP.send_task("logcrm_event_bus.send_event", [{}, "impact-stack>test"], countdown=2*3600)
Out[9]: <AsyncResult: 63e995fe-012f-497d-a523-c02211e48362>

In [10]: i.scheduled()
Out[10]: 
{'celery@impactbox': [{'eta': '2023-05-26T10:35:45.630970+00:00',
   'priority': 6,
   'request': {'id': '63e995fe-012f-497d-a523-c02211e48362',
    'name': 'logcrm_event_bus.send_event',
    'args': [{}, 'impact-stack>test'],
    'kwargs': {},
    'type': 'logcrm_event_bus.send_event',
    'hostname': 'celery@impactbox',
    'time_start': None,
    'acknowledged': False,
    'delivery_info': {'exchange': '',
     'routing_key': 'celery',
     'priority': 0,
     'redelivered': None},
    'worker_pid': None}}]}

In another shell I restart the worker service and the task is gone.

In [11]: i.scheduled()
Out[11]: {'celery@impactbox': []}

Here is the output of the celery worker service with debug log-level: systemctl-debug.txt

torotil avatar May 26 '23 08:05 torotil

Here is the service file used to launch the worker logcrm-event-bus.service. Maybe that’s where the problem lies.

torotil avatar May 26 '23 13:05 torotil

journal-20230530-rabbitmq.txt

I’ve tried reproducing this with the same setup but using RabbitMQ and the problem did not occur. Note that the service shuts down successfully in this case too (with the same service file).

torotil avatar May 30 '23 15:05 torotil