swarmkit icon indicating copy to clipboard operation
swarmkit copied to clipboard

[Feature Request] Allow live-restore in swarm mode cluster.

Open BSWANG opened this issue 7 years ago • 16 comments

The docker daemon live-restore option is not compatible with swarm mode cluster, --live-restore daemon configuration is incompatible with swarm mode. I guess the live-restore is conflict with the swarmkit orchestration on judging the node healthy and migrating service's tasks. When the node's dockerd exit, the node's tasks will be migrated. But live-restore is useful on upgrading docker daemon in production environment. So will live-restore support in swarm mode cluster? Can we specify the node's heartbeat time for node failure for prevent task migrate on upgrading node.

BSWANG avatar Sep 20 '17 09:09 BSWANG

@nishanttotla Pls take a look

xianlubird avatar Oct 27 '17 02:10 xianlubird

Would require updating the agent and dispatcher to be able to live-restore containers. Not sure how difficult it would be. Adding some labels to track this, but not providing any answers on when it might get done. Y'all are certainly welcome to provide a pull request though.

dperny avatar Jan 16 '18 23:01 dperny

any updates on this?

TheAliAbbasi avatar Aug 19 '18 06:08 TheAliAbbasi

Any updates in 2019?

It would be an amazing feature in a productive environment...

matze18888 avatar Jul 26 '19 08:07 matze18888

Now in 2020. Doesn't seem to be much activity here. Is there somewhere else where progress on this is tracked?

DrPyser avatar Jan 02 '20 17:01 DrPyser

This likely is a low priority, there's little motivation for this to be added as most Swarm clusters would automatically move the containers off a node when you drain it for updates and would reschedule them on other nodes.

braunsonm avatar Jan 19 '20 17:01 braunsonm

Swarm automatically moves tasks to other nodes. If you run containers outside of Swarm, using docker run, they will be terminated.

mhemrg avatar Jan 20 '20 06:01 mhemrg

Yes, and what if the docker daemon crashes? It happened a few times for me that the docker daemon got stuck on some corrupted state from a container. The only solution that worked short of rebooting, was to restart the daemon. Without live-restore we can't do this without killing all our containers right? Consider that we are using placement constraints, so our services cannot just be moved on any node.

DrPyser avatar Jan 22 '20 19:01 DrPyser

And yeah, not all our containers are part of swarm services.

DrPyser avatar Jan 22 '20 19:01 DrPyser

The live-restore feature seems to me like a basic reliability requirement. Containers don't need the daemon to run, do they? So why does taking it down kills all containers? I don't fully grok the docker stack, but isn't dockerd responsible for the API server, delegating container management to containerd/runc/etc?

DrPyser avatar Jan 22 '20 20:01 DrPyser

@dperny @thaJeztah Do you guys have any plans for this feature? We really need this in our production set up. I'm not sure about difficulties and why in the first place, this feature marked as incompatible with the swarm to help to do this issue.

mhemrg avatar Jan 23 '20 07:01 mhemrg

I think this would be difficult to support/add; even though the containers could be kept running, other parts would still shut-down when upgrading the daemon, which means that (e.g.) swarm managers won't be able to communicate with the worker during that time; as a result, the reconciliation loop would kick in, and managers would reschedule tasks to be deployed on other nodes.

Once the daemon comes back up, the containers that were kept running would still be shut-down (because they've been rescheduled).

live-restore and swarm services both are addressing the same (kind of) problem (but in different ways); live-restore for a single, non-orchestrated system, and swarm services for orchestrated systems (providing high availability through "redundancy" / reconciliation)

thaJeztah avatar Jan 23 '20 09:01 thaJeztah

@thaJeztah the scenario you're describing is still somewhat better than the current one. At least, standalone containers will keep working as usual. And if the docker daemon restart happens fast enough, the swarm manager(s) could wait a bit before rescheduling. At worst, a short downtime for the swarm services being rescheduled(and no downtime for other containers). Undesirable, but acceptable in many situations.

DrPyser avatar Jan 24 '20 16:01 DrPyser

+1 from me. I dont need any of the orcestration and am basically using swarm just so containers on different hosts can be part of the same overlay network. I don't want these containers to die if my docker daemon crashes.

richiereynolds avatar Apr 28 '22 10:04 richiereynolds

@richiereynolds same here. Is there a way to have live restore only apply to non-swarm containers? This is a pretty big issue when running a system upgrade to upgrades the docker daemon and thus restarts all the containers.

reformit avatar Aug 05 '22 20:08 reformit

bruh i just need this to run Coolify

SaadBazaz avatar Dec 08 '22 05:12 SaadBazaz