bottlerocket icon indicating copy to clipboard operation
bottlerocket copied to clipboard

Checkpoint/Restart or Live Motion

Open jonathan-3play opened this issue 1 year ago • 3 comments

What I'd like:

Live container migration. The ability to checkpoint containers and restart them on a different node, ideally with an ease and confidence rivaling workload migration seen in enterprise IT using virtual machines (e.g. VMware vMotion or Hyper-V Live Migration.

This will perhaps be viewed as outside the remit of a minimized, security-first OS such as Bottlerocket. OTOH Bottlerocket aspires to be the OS, foundation, and infrastructure for containerized workloads and world-leading k8s environments (e.g. EKS). Enterprise computing has long enjoyed workload migration (vMotion released 2003 and known to be used in production at scale by 2006). We'd love to see that in the container/k8s world.

In fact, we need workload migration it in the container/k8s world. While autoscalers (e.g. Karpenter) can eagerly provision more resources when needed, if a workload contains a mixture of short-, medium-, and long-duration jobs (ours most certainly do!), autoscalers are almost guaranteed to "strand" some nodes awaiting completion of the longest running jobs. Without workload migration, there is no way to effectively consolidate the long-running jobs and "compact" the cluster's resources.

Any alternatives you've considered:

  1. Segmenting possibly long-running jobs onto a separate node pool in the hopes of stranding fewer resources. Effortful and home-grown. Difficult to accurate determine every job's likely run duration a priori. Somewhat challenging to link app-based duration signals with infrastructure-level (Karpenter/k8s) scheduling controls. Not clear node segmentation would be efficient/efficacious.
  2. CRIU. Unclear if supported on Bottlerocket, or how well.
  3. DIY checkpoint/restart. Effortful and home-grown. Feels like should be system-supported, as in the VM world.
  4. Lighting a candle that Karpenter over time becomes smarter about recognizing node stranding and using that understanding to better bin-pack jobs, revisit previous do-not-schedule and deprovisioning decisions.
  5. Probably others. None feels compelling.

jonathan-3play avatar Feb 28 '24 19:02 jonathan-3play

Hello @jonathan-3play, Thanks for cutting this well written issue! There has been some work around the ability to checkpoint and restore containers in cri-o and k8s. I don’t believe there is something off the shelf though for what you are describing. This is a pretty interesting feature request and I think there is some compelling use cases for being able to checkpoint/restore a long-running container. This issue will first require a deep dive into the current state of the various tools and what might be needed to deliver this type of functionality. I’d like to use this task to track any findings that might be of interest around checkpoint/restore and CRIU in Bottlerocket.

yeazelm avatar Feb 29 '24 19:02 yeazelm

https://github.com/kubernetes/enhancements/issues/2008

kannon92 avatar Mar 29 '24 13:03 kannon92

We recently implemented Karpenter and node-disruption can play havoc with long running processing tasks, node checkpoint-restore via CRIU/CRIK or other mechanisms would be great. I anticipate waiting till k8s 1.35 or so before the existing checkpoint function matures out of beta.

whatnick avatar May 22 '25 06:05 whatnick