criu icon indicating copy to clipboard operation
criu copied to clipboard

CRIU Checkpoint and Restore with CRI CDI and GPU Devices

Open ZeroExistence opened this issue 5 months ago • 6 comments
trafficstars

Hello, I am trying to replicate GPU checkpoint and restore demo from previous presentations related to CRIU using Kubernetes pod and containerd instead of Podman. As of now, I am troubleshooting some mount concerns during restore process.

I just want to clarify some information with regards to working GPU migration with Podman.

  1. During the demo, did Podman use pure CDI implementation in managing access to external devices and libraries? I mean, the container runtime is set directly to runc or crun without being sent into NVIDIA container toolkit?
  2. Are there some modifications made on the pod aside from CDI injection on the OCI configuration starting the pod/container?
  3. Are there some modifications made during the snapshot creation?
  4. Are there some modifications made during CRIU restore (ex. via action scripts)?

As of now, I am experiencing issue related to difference in list of mounts and devices from CRI runtime spec and runc config.json. I inspected the spec.dump within the snapshot tar, and it seems that it was referencing the CRI runtime spec instead of runc config.json.

Based on my analysis, it seems that additional CDI adjustments related to NVIDIA driver and devices were added down the line.

flowchart TD
    Kubelet -->|Process CDI like NVIDIA_VISIBLE_DEVICES via annotation| Containerd
    Containerd --> NVIDIA-CDI
    NVIDIA-CDI -->|Process NVIDIA CDI like libcuda.so and /dev/nvidia0 via /var/run/cdi?| Runc

But during restore, it seems that some CDI parts were skipped like below.

flowchart TD
    Kubelet -->|Process CDI like NVIDIA_VISIBLE_DEVICES via annotation| Containerd
    Containerd --> NVIDIA-CDI
    NVIDIA-CDI -->|Send checkpoint restore request, without re-applying the NVIDIA CDI for libcuda.so| Runc
    Runc -.->|Process Restore| CRIU
    CRIU -.->|Failed due to missing NVIDIA CDI devices| CRIU

I assume it is more issue between the NVIDIA-CDI and runc, but I want to confirm some details on higher level before digging into lower-level implementation.

Another thing, I think it is not possible to restore a Kubernetes pod using the previous config.json from runc right?

Thank you very much!

ZeroExistence avatar May 22 '25 05:05 ZeroExistence

Thanks for reaching out. I am bit confused what you are trying to do because you mix Podman and Kubernetes. So it is not totally clear.

For my demos using CRI-O I am using minimal code changes to CRI-O to make it work:

diff --git a/server/container_restore.go b/server/container_restore.go
index 1aebef56b..b880c5108 100644
--- a/server/container_restore.go
+++ b/server/container_restore.go
@@ -261,6 +261,18 @@ func (s *Server) CRImportCheckpoint(
                if dumpSpec.Linux.ReadonlyPaths != nil {
                        containerConfig.Linux.SecurityContext.ReadonlyPaths = dumpSpec.Linux.ReadonlyPaths
                }
+
+               if dumpSpec.Linux.Devices != nil {
+                       for _, d := range dumpSpec.Linux.Devices {
+                               device := &types.Device{
+                                       ContainerPath: d.Path,
+                                       HostPath:      d.Path,
+                                       Permissions:   "rw",
+                               }
+
+                               containerConfig.Devices = append(containerConfig.Devices, device)
+                       }
+               }
        }
 
        ignoreMounts := map[string]bool{

But that is all. This basically copies the devices from the checkpointed container to the restored container. The reason we have to do this is that the Nvidia Kubernetes tooling does not handle restore at all.

You were able to figure out most things on your own, but we do not change config.json in the checkpoint archive because for a restored container a complete new config.json is created by the container engine (CRI-O or containerd) and we take and modify only certain parts of the original container.

I assume it is more issue between the NVIDIA-CDI and runc, but I want to confirm some details on higher level before digging into lower-level implementation.

Yes, kind of. I was able to modify the nvidia container runtime that replaces runc to work correctly in the restore case as it is open source. Unfortunately we also need changes to the hook handling and that part is not open source. Currently the hook part from nvidia modifies config.json and creates all necessary mounts during create/start. For restore we only need the changes to config.json and the mounts would be handled by CRIU. So by not being open source Nvidia makes this unnecessary complicated to fix.

Another thing, I think it is not possible to restore a Kubernetes pod using the previous config.json from runc right?

As mentioned above we are creating a new config.json during the restore. We take certain parts of the old config.json, we adapt some parts of the old config.json. This can all be seen in the CRI-O and containerd source code.

adrianreber avatar May 22 '25 05:05 adrianreber

Thank you very much for the very valuable feedback!

This basically copies the devices from the checkpointed container to the restored container.

Yes, I guess we can work-around with it by appending it to config.json.

Currently the hook part from nvidia modifies config.json and creates all necessary mounts during create/start.

I'm also suspecting this that the hook was doing something more than injecting CDI information to config.json. I will also check this in NVIDIA's repo if possible, in https://github.com/NVIDIA/nvidia-container-toolkit/issues/1098.

For restore we only need the changes to config.json and the mounts would be handled by CRIU.

Ahhhhh...criu also needs to mount in temporary directory too. I thought criu just use the mounts prepared by runc before the restore process...

RESTORE LOGS
(00.062585) mnt: 		Will mount 3309 from /dev/null (E)
(00.062588) mnt: 		Will mount 3309 @ /tmp/.criu.mntns.3Aw2iG/mnt-0000003309 /proc/latency_stats
(00.062590) mnt: 	Read 3309 mp @ /proc/latency_stats
(00.062600) mnt: 		Will mount 3308 from /dev/null (E)
(00.062603) mnt: 		Will mount 3308 @ /tmp/.criu.mntns.3Aw2iG/mnt-0000003308 /proc/keys
(00.062605) mnt: 	Read 3308 mp @ /proc/keys
(00.062610) mnt: 		Will mount 3307 from /dev/null (E)
(00.062615) mnt: 		Will mount 3307 @ /tmp/.criu.mntns.3Aw2iG/mnt-0000003307 /proc/kcore
------
(00.064277) mnt: 		Will mount 3300 from /bus
(00.064277) mnt: 		Will mount 3300 @ /tmp/.criu.mntns.ZAS4wR/mnt-0000003300 /proc/bus
(00.064278) mnt: 	Read 3300 mp @ /proc/bus
(00.064279) Error (criu/mount.c:3137): mnt: No mapping for 3730:(null) mountpoint
(00.064650) Error (criu/cgroup.c:1998): cg: cgroupd: recv req error: No such file or directory
CHECKPOINT LOGS
(04.317554) mnt: Inspecting sharing on 3309 shared_id 0 master_id 0 (@./proc/latency_stats)
(04.317556) mnt: Inspecting sharing on 3308 shared_id 0 master_id 0 (@./proc/keys)
(04.317557) mnt: Inspecting sharing on 3307 shared_id 0 master_id 0 (@./proc/kcore)
------
(04.317566) mnt: Inspecting sharing on 3300 shared_id 0 master_id 0 (@./proc/bus)
(04.317567) mnt: Inspecting sharing on 3730 shared_id 0 master_id 1 (@./usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so)
(04.317570) mnt: Detected external slavery for 3730 via 3730
(04.317571) mnt: Inspecting sharing on 3729 shared_id 0 master_id 1 (@./usr/lib/x86_64-linux-gnu/nvidia/xorg/libglxserver_nvidia.so.570.133.20)
(04.317573) mnt: Detected external slavery for 3729 via 3729
(04.317575) mnt: Inspecting sharing on 3728 shared_id 0 master_id 0 (@./run/secrets/kubernetes.io/serviceaccount)
(04.317576) mnt: Inspecting sharing on 3727 shared_id 0 master_id 1 (@./usr/share/glvnd/egl_vendor.d/10_nvidia.json)
(04.317578) mnt: Detected external slavery for 3727 via 3727

I guess if criu does try to mount some binds outside of runc, then full CDI implementation might not easily address the mount concerns? Hmm....

In the meantime, I guess I need to focus on inspecting the generated spec of CRI runtimes before and after NVIDIA CDI hook modifies the config.json.

Thanks!

ZeroExistence avatar May 22 '25 07:05 ZeroExistence

During the demo, did Podman use pure CDI implementation in managing access to external devices and libraries

Yes, checkpoint/restore of GPU workloads with Podman should work out-of-the box when it is configured as described in https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html

Are there some modifications made during the snapshot creation? Are there some modifications made during CRIU restore (ex. via action scripts)?

To create a checkpoint of GPU container with containerd in Kubernetes, you need to add the following CRIU configuration (/etc/criu/runc.conf):

external mnt[]
enable-external-masters

This allows CRIU to auto-detect the external mounts for NVIDIA files that were created by libnvidia-container but not included in the container config.

To restore the container from a checkpoint, and resolve the following error, you can specify the external mounts in the configuration file with external mnt[KEY]:VAL

(00.064279) Error (criu/mount.c:3137): mnt: No mapping for 3730:(null) mountpoint

You can use the following command to obtain a list of the files that are mounted in the container:

$ nvidia-container-cli list --ipcs --libraries --firmwares --binaries

/usr/bin/nvidia-smi
/usr/bin/nvidia-debugdump
/usr/bin/nvidia-persistenced
/usr/bin/nvidia-cuda-mps-control
/usr/bin/nvidia-cuda-mps-server
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.570.86.10
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.570.86.10
/usr/lib/x86_64-linux-gnu/libcuda.so.570.86.10
/usr/lib/x86_64-linux-gnu/libcudadebugger.so.570.86.10
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.570.86.10
/usr/lib/x86_64-linux-gnu/libnvidia-gpucomp.so.570.86.10
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.570.86.10
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.570.86.10
/usr/lib/x86_64-linux-gnu/libnvidia-pkcs11.so.570.86.10
/usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.570.86.10
/usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.570.86.10
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.570.86.10
/usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.570.86.10
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.570.86.10
/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.570.86.10
/usr/lib/x86_64-linux-gnu/libnvcuvid.so.570.86.10
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.570.86.10
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.570.86.10
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.570.86.10
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.570.86.10
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.570.86.10
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.570.86.10
/usr/lib/x86_64-linux-gnu/libnvoptix.so.570.86.10
/usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.570.86.10
/usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.570.86.10
/usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.570.86.10
/usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.570.86.10
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.570.86.10
/lib/firmware/nvidia/570.86.10/gsp_ga10x.bin
/lib/firmware/nvidia/570.86.10/gsp_tu10x.bin

Then add them as follows in the config file:

external mnt[lib/firmware/nvidia/570.86.10/gsp_tu10x.bin]:/lib/firmware/nvidia/570.86.10/gsp_tu10x.bin

rst0git avatar May 22 '25 08:05 rst0git

Ohh thank you Radostin. I was considering the external mapping config, but it does not look portable for larger scale in my initial point of view. I guess I should work with --external mapping now for ensuring that it can work on my case. And since most of the nodes should have identical setup, using single config for every mapping should work.

For using external mnt[], there should be no issue when specifying mounts that does not exist during checkpoint and restore right? I mean, even if I add mount mapping for NVIDIA then I checkpointed a normal CPU pod, criu will just disregard the unused mount mapping?

I guess knowing all used external mounts will on our different workloads be the trick here.

I will test these and provide feedback if everything is well. Thanks!

PS. Thank you for the nvidia-container-cli command. That sure is helpful!

ZeroExistence avatar May 22 '25 09:05 ZeroExistence

I was considering the external mapping config, but it does not look portable for larger scale in my initial point of view.

@ZeroExistence We discussed this issue with @avagin and @adrianreber.

In addition, the auto-detect option (external mnt[]) is considered as "unreliable" and we are looking for a better solution.

there should be no issue when specifying mounts that does not exist during checkpoint and restore right? I mean, even if I add mount mapping for NVIDIA then I checkpointed a normal CPU pod, criu will just disregard the unused mount mapping?

Yes, CRIU should ignore the unused mount mapping when performing checkpoint/restore of a CPU-only container.

rst0git avatar May 22 '25 09:05 rst0git

Thank you for this information!

I was able to create a single node checkpoint and restore by specifying all possible NVIDIA external mounts on the config and let the auto-detection option handle the rest (ex. /dev/mqueue).

I will close this issue a bit later once I have no questions anymore. Cheers!

ZeroExistence avatar May 22 '25 11:05 ZeroExistence

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] avatar Jun 22 '25 00:06 github-actions[bot]

I will close this issue now since I don't have follow-up concern with this. Thank you for the guidance!

ZeroExistence avatar Jul 16 '25 15:07 ZeroExistence