`cuda-checkpoint` hangs during container checkpoint with k3s/containerd
Description
I'm attempting to checkpoint a PyTorch container that utilizes an NVIDIA GPU using k3s ctr c checkpoint, the process fails.
This issue seems specific to checkpointing within the container runtime, as using CRIU on a host process works correctly.
I run command:
sudo k3s ctr c checkpoint --task --rw <CONTAINER_ID> ckpt
but got following errors, the checkpoint command hangs and fails with a timeout error.:
(00.005028) Preparing image inventory (version 1)
(00.005051) Add pid ns 1 pid 2701895
(00.005059) Add net ns 2 pid 2701895
(00.005067) Add ipc ns 3 pid 2701895
(00.005074) Add uts ns 4 pid 2701895
(00.005082) Add time ns 5 pid 2701895
(00.005096) Add mnt ns 6 pid 2701895
(00.005105) Add user ns 7 pid 2701895
(00.005112) Add cgroup ns 8 pid 2701895
(00.005115) cg: Dumping cgroups for thread 2701895
(00.005132) cg: `- New css ID 1
(00.005136) cg: `- [] -> [/system.slice/k3s.service] [0]
(00.005138) cg: Set 1 is criu one
(00.005145) plugin: `cuda_plugin' hook 10 -> 0x71200b6dd6f3
(800.00523) Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 0
(800.00533) Error (cuda_plugin.c:139): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call
(800.34507) Error (cuda_plugin.c:253): cuda_plugin: Failed to launch cuda-checkpoint to retrieve state:
(800.34511) Error (cuda_plugin.c:428): cuda_plugin: Failed to get CUDA state for PID 2680881
(800.34517) net: Unlock network
(800.34519) cuda_plugin: finished cuda_plugin stage 0 err -1
(800.34533) Unfreezing tasks into 1
(800.34534) Unseizing 2679665 into 1
(800.34535) Error (compel/src/lib/infect.c:418): Unable to detach from 2679665: No such process
(800.34538) Error (criu/cr-dump.c:2111): Dumping FAILED.
While the checkpoint command is running, ps shows the cuda-checkpoint process is stalled:
$ ps aux | grep cuda-checkpoint
root 2701902 0.0 0.0 33955200 8704 ? Sl 01:22 0:00 cuda-checkpoint --get-state --pid 2680881
and I ran this command manually, it's also stalled
I found a similar issue: https://github.com/NVIDIA/cuda-checkpoint/issues/26, and set timeout 800 in /etc/criu/runc.conf, but still got same errors
When i use CRIU in host instead of container, the workload can be checkpoint / restore normally
Env
NVIDIA Driver Version: 570.86.10 CUDA Version: 12.8 CRIU Version: 4.1.1
I found that if a task's Process ID (PID) is shown in the output of nvidia-smi, it hangs when I try to checkpoint it
For example, the program below cannot be checkpointed using k3s ctr:
#include <stdio.h>
#define PORT 10000
__device__ int counter = 100;
__global__ void increment() { counter++; }
int main(void) {
cudaFree(0);
while (true) {
int hCounter = 0;
increment<<<1, 1>>>();
cudaMemcpyFromSymbol(&hCounter, counter, sizeof counter);
if (hCounter % 10000 == 0)
printf("Counter: %d\n", hCounter);
}
return 0;
}
I suspect the main process is frozen. When I trace the cuda-checkpoint process using strace, I find that it stalls while waiting on two pipes(one read, one write) connected to the main process. Meanwhile, the main process that initiated runc checkpoint is simply waiting for an exit signal from cuda-checkpoint
@Cusox The cuda-checkpoint tool would hang when attempting to checkpoint processes in a "frozen" (cgroup frozen) or "seized" (PTRACE_SEIZE) state. This is not a bug, just a known limitation (i.e., how the tool works). We handle this in CRIU and the CUDA plugin.
A friendly reminder that this issue had no activity for 30 days.