miniwdl
miniwdl copied to clipboard
docker tasks occasionally hang when worker goes down
We've seen docker service tasks sporadically get stuck in the 'preparing' or 'assigned' states indefinitely. This was under stress-testing, with rapid task turnover in a multi-node swarm, with occasional random worker shutdown.
One thing that happens in those states is docker pull
, which has a long history of sporadic hanging problems, especially the bit that coalesces concurrent requests for the same image on a host. And/or maybe Swarm doesn't recover properly if a worker node dies while tasks are waiting on an image pull or otherwise in those states.
- [ ] try with latest Docker
- [ ] add our own configurable timeout on tasks remaining in these states
Some issues specific to worker node shutdown:
https://github.com/moby/moby/issues/34280 https://github.com/moby/moby/issues/34122
Also notable: docker moves a task to the "orphaned" state only after a node has been down for 24h (!): https://github.com/docker/swarmkit/blob/ebe39a32e3ed4c3a3783a02c11cccf388818694c/manager/dispatcher/dispatcher.go#L50-L53
Possible fix #375. Per the above-linked moby issues, we're hoping the zombie tasks can be recognized by their terminal "desired state" even with a non-terminal (current) state.
The stress tests pass with #375 now merged; however, bespoke timeout logic around the docker pull
equivalent step may still be a good idea in the future. It's possible our stress test failures were also associated with transient Docker Hub performance problems, making the pull stage much longer in duration & thus more likely to coincide with the induced worker shutdowns.