bottlerocket Checkpoint/Restart or Live Motion

Checkpoint/Restart or Live Motion

Open jonathan-3play opened this issue 11 months ago • 2 comments

What I'd like:

Live container migration. The ability to checkpoint containers and restart them on a different node, ideally with an ease and confidence rivaling workload migration seen in enterprise IT using virtual machines (e.g. VMware vMotion or Hyper-V Live Migration.

This will perhaps be viewed as outside the remit of a minimized, security-first OS such as Bottlerocket. OTOH Bottlerocket aspires to be the OS, foundation, and infrastructure for containerized workloads and world-leading k8s environments (e.g. EKS). Enterprise computing has long enjoyed workload migration (vMotion released 2003 and known to be used in production at scale by 2006). We'd love to see that in the container/k8s world.

In fact, we need workload migration it in the container/k8s world. While autoscalers (e.g. Karpenter) can eagerly provision more resources when needed, if a workload contains a mixture of short-, medium-, and long-duration jobs (ours most certainly do!), autoscalers are almost guaranteed to "strand" some nodes awaiting completion of the longest running jobs. Without workload migration, there is no way to effectively consolidate the long-running jobs and "compact" the cluster's resources.

Any alternatives you've considered:

Segmenting possibly long-running jobs onto a separate node pool in the hopes of stranding fewer resources. Effortful and home-grown. Difficult to accurate determine every job's likely run duration a priori. Somewhat challenging to link app-based duration signals with infrastructure-level (Karpenter/k8s) scheduling controls. Not clear node segmentation would be efficient/efficacious.
CRIU. Unclear if supported on Bottlerocket, or how well.
DIY checkpoint/restart. Effortful and home-grown. Feels like should be system-supported, as in the VM world.
Lighting a candle that Karpenter over time becomes smarter about recognizing node stranding and using that understanding to better bin-pack jobs, revisit previous do-not-schedule and deprovisioning decisions.
Probably others. None feels compelling.

Feb 28 '24 19:02 jonathan-3play

bottlerocket bottlerocket copied to clipboard

Checkpoint/Restart or Live Motion

bottlerocket
bottlerocket copied to clipboard