machinery
machinery copied to clipboard
When a worker pod is killed, no mechanism for retrying task
We're using Redis broker + DyanmoDB backend, and we've noticed that when a worker pod is terminated (ungracefully) and the task was still running, the task stays in STARTED
state. It seems as though Machinery doesn't have a timeout at which point it we re-queue tasks that have been in STARTED
state for a long period of time. This seems like a critical feature for fault tolerance.
I face the same issue here.
I face the same issue here.
Do we have any updates or workarounds against this?