kombu
kombu copied to clipboard
ETA tasks lost on worker TERM restart
I have clusters of workers listening to a redis server. When i do restart worker(using salt stack), the running tasks will comple execution since its a TERM signal, but all the scheduled tasks (ETA set) are lost
Do you have any steps to reproduce this?
What does the worker log say at shutdown?
I have around 20 instances running as a cluster, consume from a redis server located in a separate instance. I use saltstack to deploy things on these 20 instances. Will be doing worker restart as well along with deploys.
This is an instance celery log [2015-10-27 11:25:11,889: INFO/MainProcess] mingle: searching for neighbors [2015-10-27 11:25:12,896: INFO/MainProcess] mingle: all alone
and it seems the instances do not sync each other since worker restart happenig in same point in time.
Problem found is that when we restart worker during execution of a task, all the scheduled task will be lost from the queue
I use supervisor as daemon process for worker process
For example. Test1 and Test2 are two consumer instances of a queue. i have added two tasks, T1(countdown = 60), T2(countdown = 86400).
I just restarted the worker using saltstack, now both the tasks are there in Test2
worker: Warm shutdown (MainProcess) [2015-10-27 11:25:09,598: WARNING/MainProcess] Restoring 1 unacknowledged message(s). [2015-10-27 11:25:11,861: WARNING/MainProcess] /usr/local/lib/python2.7/dist-packages/celery/apps/worker.py:161: CDeprecationWarning: Starting from version 3.2 Celery will refuse to accept pickle by default.
The pickle serializer is a security concern as it may give attackers the ability to execute any command. It's important to secure your broker from unauthorized access when using pickle, so we think that enabling pickle should require a deliberate action and not be the default choice.
If you depend on pickle then you should set a setting to disable this warning and to be sure that everything will continue working when you upgrade to Celery 3.2::
CELERY_ACCEPT_CONTENT = ['pickle', 'json', 'msgpack', 'yaml']
You must only enable the serializers that you will actually use.
warnings.warn(CDeprecationWarning(W_PICKLE_DEPRECATED)) [2015-10-27 11:25:11,881: INFO/MainProcess] Connected to redis://10.83.73.169:9875/0 [2015-10-27 11:25:11,889: INFO/MainProcess] mingle: searching for neighbors [2015-10-27 11:25:12,896: INFO/MainProcess] mingle: all alone [2015-10-27 11:25:12,913: WARNING/MainProcess] [email protected] ready. [2015-10-27 11:25:13,127: INFO/MainProcess] Received task: taskXXX.Task_XXX[134feb23-b82f-4cd2-8974-2a6b150a9386] eta:[2016-02-04 12:25:58.399643+00:00] [2015-10-27 11:25:13,129: INFO/MainProcess] Received task: task_XXX.Task_XXX[80c608c3-b7fa-4318-8df9-73e435df514d] eta:[2015-10-27 11:26:19.881562+00:00]
i restarted the supervisor when a task T1 (scheduled at 2015-10-27 11:26:19.881562+00:00 ) is running
The running task finishes the execution as i see my database since it does some updates in my tables , and the task T2 (scheduled to run at 2016-02-04 12:25:58.399643+00:00) is not found in the queue (not restored by either Test1 or Test2)
Start the worker with -l debug
please, to get more logging.
Same here in 2019
Having this issue with RabbitMQ as well
celery n kombu version?
Also having this issue with a task ETA that is scheduled to occur after a proc restart
kombu==4.6.4
celery==4.3.0
@auvipy so just to check, is this solved in some version? Or is it an ongoing issue?
I don't know! need some one to verify in production using the latest development branch. but most probably no
I’m observing the same issue in production with a very simple setup: Single worker, single task, redis backend, unlimited auto-retry, acks_late. Versions are:
kombu==5.2.4 celery==5.2.7
kombu said
Restoring xxx unacknowledged message(s)
in the logs. But the Redis DB is empty now.
I can reproduce this very easily in my local dev setup (vagrant box) and might help to track it down.
In [9]: worker.APP.send_task("logcrm_event_bus.send_event", [{}, "impact-stack>test"], countdown=2*3600)
Out[9]: <AsyncResult: 63e995fe-012f-497d-a523-c02211e48362>
In [10]: i.scheduled()
Out[10]:
{'celery@impactbox': [{'eta': '2023-05-26T10:35:45.630970+00:00',
'priority': 6,
'request': {'id': '63e995fe-012f-497d-a523-c02211e48362',
'name': 'logcrm_event_bus.send_event',
'args': [{}, 'impact-stack>test'],
'kwargs': {},
'type': 'logcrm_event_bus.send_event',
'hostname': 'celery@impactbox',
'time_start': None,
'acknowledged': False,
'delivery_info': {'exchange': '',
'routing_key': 'celery',
'priority': 0,
'redelivered': None},
'worker_pid': None}}]}
In another shell I restart the worker service and the task is gone.
In [11]: i.scheduled()
Out[11]: {'celery@impactbox': []}
Here is the output of the celery worker service with debug log-level: systemctl-debug.txt
Here is the service file used to launch the worker logcrm-event-bus.service. Maybe that’s where the problem lies.
I’ve tried reproducing this with the same setup but using RabbitMQ and the problem did not occur. Note that the service shuts down successfully in this case too (with the same service file).