swarmkit
swarmkit copied to clipboard
[Feature Request] Allow live-restore in swarm mode cluster.
The docker daemon live-restore
option is not compatible with swarm mode
cluster, --live-restore daemon configuration is incompatible with swarm mode
. I guess the live-restore
is conflict with the swarmkit orchestration on judging the node healthy and migrating service's tasks. When the node's dockerd exit, the node's tasks will be migrated. But live-restore
is useful on upgrading docker daemon in production environment. So will live-restore
support in swarm mode cluster? Can we specify the node's heartbeat time for node failure for prevent task migrate on upgrading node.
@nishanttotla Pls take a look
Would require updating the agent and dispatcher to be able to live-restore containers. Not sure how difficult it would be. Adding some labels to track this, but not providing any answers on when it might get done. Y'all are certainly welcome to provide a pull request though.
any updates on this?
Any updates in 2019?
It would be an amazing feature in a productive environment...
Now in 2020. Doesn't seem to be much activity here. Is there somewhere else where progress on this is tracked?
This likely is a low priority, there's little motivation for this to be added as most Swarm clusters would automatically move the containers off a node when you drain it for updates and would reschedule them on other nodes.
Swarm automatically moves tasks to other nodes.
If you run containers outside of Swarm, using docker run
, they will be terminated.
Yes, and what if the docker daemon crashes? It happened a few times for me that the docker daemon got stuck on some corrupted state from a container. The only solution that worked short of rebooting, was to restart the daemon. Without live-restore we can't do this without killing all our containers right? Consider that we are using placement constraints, so our services cannot just be moved on any node.
And yeah, not all our containers are part of swarm services.
The live-restore feature seems to me like a basic reliability requirement. Containers don't need the daemon to run, do they? So why does taking it down kills all containers? I don't fully grok the docker stack, but isn't dockerd
responsible for the API server, delegating container management to containerd
/runc
/etc?
@dperny @thaJeztah Do you guys have any plans for this feature? We really need this in our production set up. I'm not sure about difficulties and why in the first place, this feature marked as incompatible with the swarm to help to do this issue.
I think this would be difficult to support/add; even though the containers could be kept running, other parts would still shut-down when upgrading the daemon, which means that (e.g.) swarm managers won't be able to communicate with the worker during that time; as a result, the reconciliation loop would kick in, and managers would reschedule tasks to be deployed on other nodes.
Once the daemon comes back up, the containers that were kept running would still be shut-down (because they've been rescheduled).
live-restore and swarm services both are addressing the same (kind of) problem (but in different ways); live-restore for a single, non-orchestrated system, and swarm services for orchestrated systems (providing high availability through "redundancy" / reconciliation)
@thaJeztah the scenario you're describing is still somewhat better than the current one. At least, standalone containers will keep working as usual. And if the docker daemon restart happens fast enough, the swarm manager(s) could wait a bit before rescheduling. At worst, a short downtime for the swarm services being rescheduled(and no downtime for other containers). Undesirable, but acceptable in many situations.
+1 from me. I dont need any of the orcestration and am basically using swarm just so containers on different hosts can be part of the same overlay network. I don't want these containers to die if my docker daemon crashes.
@richiereynolds same here. Is there a way to have live restore only apply to non-swarm containers? This is a pretty big issue when running a system upgrade to upgrades the docker daemon and thus restarts all the containers.
bruh i just need this to run Coolify