miniwdl icon indicating copy to clipboard operation
miniwdl copied to clipboard

docker tasks occasionally hang when worker goes down

Open mlin opened this issue 4 years ago • 3 comments

We've seen docker service tasks sporadically get stuck in the 'preparing' or 'assigned' states indefinitely. This was under stress-testing, with rapid task turnover in a multi-node swarm, with occasional random worker shutdown.

One thing that happens in those states is docker pull, which has a long history of sporadic hanging problems, especially the bit that coalesces concurrent requests for the same image on a host. And/or maybe Swarm doesn't recover properly if a worker node dies while tasks are waiting on an image pull or otherwise in those states.

  • [ ] try with latest Docker
  • [ ] add our own configurable timeout on tasks remaining in these states

mlin avatar Apr 13 '20 21:04 mlin

Some issues specific to worker node shutdown:

https://github.com/moby/moby/issues/34280 https://github.com/moby/moby/issues/34122

Also notable: docker moves a task to the "orphaned" state only after a node has been down for 24h (!): https://github.com/docker/swarmkit/blob/ebe39a32e3ed4c3a3783a02c11cccf388818694c/manager/dispatcher/dispatcher.go#L50-L53

mlin avatar Apr 13 '20 21:04 mlin

Possible fix #375. Per the above-linked moby issues, we're hoping the zombie tasks can be recognized by their terminal "desired state" even with a non-terminal (current) state.

mlin avatar Apr 14 '20 00:04 mlin

The stress tests pass with #375 now merged; however, bespoke timeout logic around the docker pull equivalent step may still be a good idea in the future. It's possible our stress test failures were also associated with transient Docker Hub performance problems, making the pull stage much longer in duration & thus more likely to coincide with the induced worker shutdowns.

mlin avatar Apr 14 '20 06:04 mlin