kombu
kombu copied to clipboard
Tasks lost in redis
My original question was posted in celery, but I believe this to be specifically related to kombu and the redis transport.
I have an issue where tasks (scheduled by celery to run in the future) are not being run. I believe the issue is that the messages in redis are being dropped. For example, I saw a message in my logs:
09a20a96-0c1e-478f-b620-4c9404e3c2fc sent to queue.
Upon looking in the unacked
HASH in redis, I can see a message with "correlation_id": "09a20a96-0c1e-478f-b620-4c9404e3c2fc"
exists - everything is good.
Later in the day however, looking in the same unacked
HASH, I no longer see the message for 09a20a96-0c1e-478f-b620-4c9404e3c2fc
.
It should be mentioned that during the day, the celery workers do get SIGTERM'd and I see that several unacknowledged messages are restored into the celery
SET in redis.
I believe the issue lies within the time period of celery being shut down and having the messages from unacked
rewritten to celery
. I see two points of failure here:
- The messages from
unacked
->celery
are never committed - The messages from
unacked
->celery
are committed cleanly but when celery boots up the messages fromcelery
->unacked
are not committed.
BTW I am seeing WorkerLostError
and celery.concurrency.asynpool in verify_process_alive
errors in my logs but I'm not sure if they're related to my question here.
It is a known issue that the worker may lose up to one message if abruptly terminated.
With ack emulation it will reserve one message and then add it to the backup hash, but these operations are in different transactions. I don't think think it can lose more than one message and it will only do so if 1) shutdown does not complete, or 2) the redis server go offline before the second operation completes.
I imagine there is a solution to this problem, but I don't have much time to work on the redis transport. There is a command in the redis API that is designed for this problem (http://redis.io/commands/rpoplpush), but it's useless for us as it does not let us consume from multiple keys.
I hope this bug will be fixed soon... Any long-running task will be rescheduled to run again if the worker gets shutdown signal before the task finishes. After the graceful shutdown the task goes back to the queue because it was not acknowledged...
may I know the updates of this issue?