How to run criu inside a container to dump/restore process ?
Description
I'm trying to integrate criu and apptainer , which is a popular container software in high performance computing environments. The initial solution is to exec CRIU inside the container via some build-in commands. I encountered an error while trying to do a dump. My intuition tells me it might have something to do with /dev being external bind mounted into the container.
The dumped process redirected its stdin to /dev/null, and /dev/ is mounted from host.
CRIU dump cmd is criu dump --unprivileged --tree $PID --images-dir $IMG_DIR --work-dir $WORK_DIR --shell-job -v4 --log-file dump.log.
Additional information you deem important (e.g. issue happens only occasionally):
CRIU logs and information:
CRIU full dump/restore logs:
(00.038535) Dumping opened files (pid: 57)
(00.038576) ----------------------------------------
(00.038600) Sent msg to daemon 71 0 0
pie: 57: __fetched msg: 71 0 0
pie: 57: __sent ack msg: 71 71 0
pie: 57: Daemon waits for command
(00.038653) Wait for ack 71 on daemon socket
(00.038660) Fetched ack: 71 71 0
(00.038677) 57 fdinfo 0: pos: 0 flags: 100000/0
(00.038834) Error (criu/files-reg.c:1817): Can't lookup mount=62 for fd=0 path=/dev/null
(00.038841) ----------------------------------------
(00.038882) Error (criu/cr-dump.c:1675): Dump files (pid: 57) failed with -1
(00.038905) Waiting for 57 to trap
(00.038938) Daemon 57 exited trapping
(00.038947) Sent msg to daemon 3 0 0
pie: 57: __fetched msg: 3 0 0
pie: 57: 57: new_sp=0x7f6644491888 ip 0x7f664455e388
(00.106847) 57 was trapped
(00.107052) 57 was trapped
(00.107062) 57 (native) is going to execute the syscall 15, required is 15
(00.107213) 57 was stopped
(00.107540) Unlock network
(00.108080) Unfreezing tasks into 1
(00.108089) Unseizing 57 into 1
(00.108111) Unseizing 58 into 1
(00.108139) Error (criu/cr-dump.c:2099): Dumping FAILED.
Output of `criu --version`:
Version: 3.17
GitID: v3.17-117-g50db2be1a
Additional environment details:
host mountinfo:
$ cat /proc/self/mountinfo
53 60 0:27 / /mnt/wsl rw,relatime shared:1 - tmpfs none rw
54 60 0:29 / /usr/lib/wsl/drivers ro,nosuid,nodev,noatime - 9p drivers ro,dirsync,aname=drivers;fmask=222;dmask=222,mmap,access=client,msize=65536,trans=fd,rfd=7,wfd=7
58 60 0:33 / /usr/lib/wsl/lib rw,relatime - overlay none rw,lowerdir=/gpu_lib_packaged:/gpu_lib_inbox,upperdir=/gpu_lib/rw/upper,workdir=/gpu_lib/rw/work
60 44 8:32 / / rw,relatime - ext4 /dev/sdc rw,discard,errors=remount-ro,data=ordered
61 60 0:2 /init /init rw - rootfs rootfs rw,size=4020640k,nr_inodes=1005160
62 60 0:5 / /dev rw,nosuid,relatime - devtmpfs none rw,size=4020668k,nr_inodes=1005167,mode=755
63 60 0:20 / /sys rw,nosuid,nodev,noexec,noatime - sysfs sysfs rw
64 60 0:38 / /proc rw,nosuid,nodev,noexec,noatime - proc proc rw
65 62 0:39 / /dev/pts rw,nosuid,noexec,noatime - devpts devpts rw,gid=5,mode=620,ptmxmode=000
66 60 0:40 / /run rw,nosuid,nodev - tmpfs none rw,mode=755
67 66 0:41 / /run/lock rw,nosuid,nodev,noexec,noatime - tmpfs none rw
68 66 0:42 / /run/shm rw,nosuid,nodev,noatime - tmpfs none rw
69 66 0:43 / /run/user rw,nosuid,nodev,noexec,noatime - tmpfs none rw,mode=755
70 64 0:28 / /proc/sys/fs/binfmt_misc rw,relatime - binfmt_misc binfmt_misc rw
71 63 0:44 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,mode=755
72 71 0:45 / /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw
73 71 0:46 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpuset
74 71 0:47 / /sys/fs/cgroup/cpu rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpu
75 71 0:48 / /sys/fs/cgroup/cpuacct rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpuacct
76 71 0:49 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,blkio
77 71 0:26 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,memory
78 71 0:50 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,devices
79 71 0:51 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
80 71 0:52 / /sys/fs/cgroup/net_cls rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,net_cls
81 71 0:53 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,perf_event
82 71 0:54 / /sys/fs/cgroup/net_prio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,net_prio
83 71 0:55 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,hugetlb
84 71 0:56 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,pids
85 71 0:57 / /sys/fs/cgroup/rdma rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,rdma
86 71 0:58 / /sys/fs/cgroup/misc rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,misc
131 60 0:59 / /mnt/c rw,noatime - 9p drvfs rw,dirsync,aname=drvfs;path=C:\;uid=1000;gid=1000;symlinkroot=/mnt/,mmap,access=client,msize=262144,trans=virtio
132 60 0:60 / /mnt/d rw,noatime - 9p drvfs rw,dirsync,aname=drvfs;path=D:\;uid=1000;gid=1000;symlinkroot=/mnt/,mmap,access=client,msize=262144,trans=virtio
133 60 8:32 /var/lib/docker /var/lib/docker rw,relatime shared:2 - ext4 /dev/sdc rw,discard,errors=remount-ro,data=ordered
My intuition tells me it might have something to do with /dev being external bind mounted into the container.
That sounds right. All mounts from the outside of the container need to be marked as external. Running CRIU in a OCI container (Docker/Podman) usually works without any additional parameters as all the mounts are usually setup correctly.
For runc/crun checkpointing all external mount points into the container need be part of the container configuration. Usually that is config.json. runc/crun marks all external mounts before calling CRIU and so CRIU knows about them.
First try, based on your information, would probably be to mark /tmp/rootfs-4249390345/root/dev as external. But there are a lot of those mounts from the outside. So if you have a way to ask apptainer for those mounts that would make it easier for you.
@adrianreber will go into the namespace using "nsenter -t container_PID --net bash", and mount from inside of the namespace work?
@adrianreber will go into the namespace using "nsenter -t container_PID --net bash", and mount from inside of the namespace work?
This just enters the network namespace, right? Not sure how that would help.
There are container engines (like Docker and Podman) which enable you to run CRIU inside of the container. Please take a look at those and see how they are set up. Would be nice if this work also in apptainer. Although I am not sure if that is possible with apptainer. It seems like apptainer exposes a lot of host directories into the container because that is important for MPI applications (at least that is what I remember about it, never used it myself). Like already said, with a very long list of directories marked as external, based on the mountinfo, it might work.
If you goal is to integrate CRIU into apptainer, I am not sure there is value in figuring out how to run CRIU in the container. There are multiple container runtimes out there which support checkpointing of containers: crun, lxc, runc, youki (partially). I think taking a look at those and see if you can do the same in apptainer would be a good approach.
@adrianreber will go into the namespace using "nsenter -t container_PID --net bash", and mount from inside of the namespace work?
This just enters the network namespace, right? Not sure how that would help.
There are container engines (like Docker and Podman) which enable you to run CRIU inside of the container. Please take a look at those and see how they are set up. Would be nice if this work also in apptainer. Although I am not sure if that is possible with apptainer. It seems like apptainer exposes a lot of host directories into the container because that is important for MPI applications (at least that is what I remember about it, never used it myself). Like already said, with a very long list of directories marked as external, based on the mountinfo, it might work.
If you goal is to integrate CRIU into apptainer, I am not sure there is value in figuring out how to run CRIU in the container. There are multiple container runtimes out there which support checkpointing of containers: crun, lxc, runc, youki (partially). I think taking a look at those and see if you can do the same in apptainer would be a good approach.
Thanks for the list of CRIU compatible container runtimes , will stick to runc then. ;)
@adrianreber will go into the namespace using "nsenter -t container_PID --net bash", and mount from inside of the namespace work?
This just enters the network namespace, right? Not sure how that would help.
There are container engines (like Docker and Podman) which enable you to run CRIU inside of the container. Please take a look at those and see how they are set up. Would be nice if this work also in apptainer. Although I am not sure if that is possible with apptainer. It seems like apptainer exposes a lot of host directories into the container because that is important for MPI applications (at least that is what I remember about it, never used it myself). Like already said, with a very long list of directories marked as external, based on the mountinfo, it might work.
If you goal is to integrate CRIU into apptainer, I am not sure there is value in figuring out how to run CRIU in the container. There are multiple container runtimes out there which support checkpointing of containers: crun, lxc, runc, youki (partially). I think taking a look at those and see if you can do the same in apptainer would be a good approach.
Thank you for your suggestion! Apptainer is container runtime for multi-user scenarios, which is determined by the way HPC clusters are used. So the integration solution needs to consider security factors. Inspired by this report, I decided to run CRIU inside the container, limit its scope(with pid namespace enabled) and use the --unprivileged option to strip root identity.
@adrianreber will go into the namespace using "nsenter -t container_PID --net bash", and mount from inside of the namespace work?
This just enters the network namespace, right? Not sure how that would help.
There are container engines (like Docker and Podman) which enable you to run CRIU inside of the container. Please take a look at those and see how they are set up. Would be nice if this work also in apptainer. Although I am not sure if that is possible with apptainer. It seems like apptainer exposes a lot of host directories into the container because that is important for MPI applications (at least that is what I remember about it, never used it myself). Like already said, with a very long list of directories marked as external, based on the mountinfo, it might work.
If you goal is to integrate CRIU into apptainer, I am not sure there is value in figuring out how to run CRIU in the container. There are multiple container runtimes out there which support checkpointing of containers: crun, lxc, runc, youki (partially). I think taking a look at those and see if you can do the same in apptainer would be a good approach.
It seems like K8s with CRIU's migration did not call criu restore in runc, however it calls create and start in runc? Is there some patches to runc, to get the create and start process calls criu restore in this repo please?
It seems like K8s with CRIU's migration did not call criu restore in runc, however it calls create and start in runc? Is there some patches to runc, to get the create and start process calls criu restore in this repo please?
There is something you misunderstood. On the runc level it is a normal restore.
It seems like K8s with CRIU's migration did not call criu restore in runc, however it calls create and start in runc? Is there some patches to runc, to get the create and start process calls criu restore in this repo please?
There is something you misunderstood. On the runc level it is a normal restore.
Thanks so much for your instructions. Any hints about patches to this part on containerd or K8s layer please? I would like to learn what the process is to let containerd restore the container please? ( such as how fs-diff is patched to rootfs, and img restore to process context)
It is all part of CRI-O. No external patches necessary.
It is all part of CRI-O. No external patches necessary.
Got it , thanks !
It is all part of CRI-O. No external patches necessary.
listened to your talk last week, any hints on how to try out the POC examples in the talk please, maybe share the scripts for the demo in a test folder of criu repo please?
It is all part of CRI-O. No external patches necessary.
listened to your talk last week, any hints on how to try out the POC examples in the talk please, maybe share the scripts for the demo in a test folder of criu repo please?
Not really relevant here, but there is nothing special. Just start a container in Kubernetes and for checkpointing please see:
https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/
Thanks :)
A friendly reminder that this issue had no activity for 30 days.