checkpoint/restore of /dev/shm in containers
I am not sure if this is scope of runc or criu but I just want to check if there is a way to include the files within tmpfs (ex. /dev/shm) within the snapshot of container?
I am currently working with applications that is actively utilizing /dev/shm. During checkpoint, everything is fine but it is causing concern during restore, since tmpfs volumes are mounted empty by default during initialization.
During checkpoint:
(00.112009) vma 7289949cc000 borrows vfi from previous 7289949cb000
(00.112019) Handling VMA with the following smaps entry: 7289949cf000-7289949d0000 rw-s 00000000 00:1a 26 /dev/shm/pym-43-q4izhbog (deleted)
(00.112023) Found regular file mapping, OK
(00.112031) Dumping path for -3 fd via self 15 [/dev/shm/pym-43-q4izhbog (deleted)]
(00.112033) Strip ' (deleted)' tag from './dev/shm/pym-43-q4izhbog (deleted)'
(00.112034) Dumping ghost file for fd 15 id 0x96
(00.112035) mnt: Path `/dev/shm/pym-43-q4izhbog' resolved to `./dev/shm' mountpoint
(00.112037) Dumping ghost file contents (id 0x1)
(00.112080) Only file size could be stored for validation for file /dev/shm/pym-43-q4izhbog
During restore:
(00.407305) 95: Restoring size 0x10000 for 0x7db21
(00.407306) 1: Opening 0x007bb55ffde000-0x007bb55ffe8000 0x0000000002c000 (20000041) vma
(00.407309) 43: 4885: Link dev/shm/pym-43-q4izhbog.cr.1.ghost -> dev/shm/pym-43-q4izhbog
(00.407312) 1: Opening 0x007bb55ffe8000-0x007bb55ffea000 0x00000000036000 (41) vma
(00.407312) 95: Send fd 10 to /crtools-fd-95-571179c1-2c99-4e2e-a2e9-34686deabc01
(00.407316) 1: Opening 0x007bb55ffea000-0x007bb55ffec000 0x00000000038000 (41) vma
(00.407324) 95: Create fd for 45
(00.407327) 95: Creating pipe pipe_id=0x7db21 id=0x138
(00.407332) 95: Further fle=0x7de8b0bd5540, pid=95
(00.407336) 95: Create fd for 46
(00.407339) 95: epoll: Restore eventpoll: id 0x000139 flags 0x02
(00.407342) 43: File dev/shm/pym-43-q4izhbog could only be validated with file size
(00.407346) 95: Create fd for 47
(00.407347) 43: Unlink: 4885:dev/shm/pym-43-q4izhbog
(00.407351) 95: Creating pipe pipe_id=0x7db22 id=0x13a
(00.407359) 95: Restoring size 0x10000 for 0x7db22
(00.407361) 43: Create fd for 18
------
(00.409543) pie: 94: mmap(0x7cab54732000 -> 0x7cab54733000, 0x1 0x12 4)
(00.409545) 43: Opening 0x00728859426000-0x00728859427000 0x00000000015000 (41) vma
(00.409549) 43: Opening 0x00728859427000-0x00728859428000 0x00000000016000 (41) vma
(00.409551) 43: Opening 0x00728859429000-0x00728859437000 0000000000000000 (20000041) vma
(00.409552) pie: 94: mmap(0x7cab54733000 -> 0x7cab54734000, 0x3 0x12 4)
(00.409560) pie: 94: mmap(0x7cab54734000 -> 0x7cab54735000, 0x3 0x12 4)
(00.410385) mnt: Switching to new ns to clean ghosts
(00.410451) Unlink remap /tmp/.criu.mntns.rTqpuG/mnt-0000004885/pym-43-q4izhbog.cr.1.ghost
(00.410888) Error (criu/cr-restore.c:2324): Restoring FAILED.
(00.412187) Error (criu/cgroup.c:1998): cg: cgroupd: recv req error: No such file or directory
The error is related to ghost file but I assume it has more to do with the handling files within /dev/shm which is considered as external mount.
I initially tested using skip-mnt /dev/shm which resulted in fsnotify and irmap related error, so it looks like it was not the option to use here.
If it is not yet possible to include tmpfs files in snapshot, is it possible to manually manage it with action script? I inspected the scripts/tmp-files.sh and it looks like it was doable.
But my concern on this is that when should we extract the tmpfs files during restore? I assume running it on post-setup-namespaces would be the best place since most of the mounts were already setup like this?
(00.188311) 1: mnt-v2: Move mount 4879 from /tmp/.criu.mntns.rTqpuG/mnt-0000004879 to /tmp/.criu.mntns.rTqpuG/12-0000000000/dev
(00.188325) 1: mnt-v2: Move mount 4883 from /tmp/.criu.mntns.rTqpuG/mnt-0000004883 to /tmp/.criu.mntns.rTqpuG/12-0000000000/dev/mqueue
(00.188340) 1: mnt-v2: Move mount 4884 from /tmp/.criu.mntns.rTqpuG/mnt-0000004884 to /tmp/.criu.mntns.rTqpuG/12-0000000000/dev/pts
(00.188354) 1: mnt-v2: Move mount 4885 from /tmp/.criu.mntns.rTqpuG/mnt-0000004885 to /tmp/.criu.mntns.rTqpuG/12-0000000000/dev/shm
(00.188369) 1: mnt-v2: Move mount 4886 from /tmp/.criu.mntns.rTqpuG/mnt-0000004886 to /tmp/.criu.mntns.rTqpuG/12-0000000000/dev/termination-log
(00.188385) 1: mnt-v2: Move mount 4881 from /tmp/.criu.mntns.rTqpuG/mnt-0000004881 to /tmp/.criu.mntns.rTqpuG/12-0000000000/proc
(00.188402) 1: mnt-v2: Move mount 4219 from /tmp/.criu.mntns.rTqpuG/mnt-0000004219 to /tmp/.criu.mntns.rTqpuG/12-0000000000/proc/bus
(00.188418) 1: mnt-v2: Move mount 4220 from /tmp/.criu.mntns.rTqpuG/mnt-0000004220 to /tmp/.criu.mntns.rTqpuG/12-0000000000/proc/fs
(00.188433) 1: mnt-v2: Move mount 4392 from /tmp/.criu.mntns.rTqpuG/mnt-0000004392 to /tmp/.criu.mntns.rTqpuG/12-0000000000/proc/irq
Is writing those tmpfs files on /tmp/.criu.mntns.rTqpuG/12-0000000000/dev/shm should do the trick?
Thank you very much in advance!
In what environment is your container running? Kubernetes, Podman, runc?
tmpfs is usually part of the checkpoint. CRIU will include all the content of a tmpfs in the checkpoint.
Not sure how your /dev/shm is setup.
In what environment is your container running? Kubernetes, Podman, runc?
I am using Kubernetes v1.33 with containerd v2.1.1.
tmpfs is usually part of the checkpoint. CRIU will include all the content of a tmpfs in the checkpoint.
Ohhh?? That sounds great if it was being handled already.
I saw this on my /dev/shm log for my other demo container so I thought it was only handling validation.
(00.136135) Dumping path for 3 fd via self 21 [/dev/shm/date.txt]
(00.136167) Only file size could be stored for validation for file /dev/shm/date.txt
(00.136194) 135912 fdinfo 4: pos: 0 flags: 0/0
(00.136226) fsnotify: wd: wd 0x000001 s_dev 0x00001a i_ino 0x23 mask 0x000002
(00.136230) fsnotify: [fhandle] bytes 0x00000c type 0x000001 __handle 0x000023277b82d3:0000000000000000
(00.136242) fsnotify: Trying via mntid 4902 root / ns_mountpoint @./dev/shm (24)
(00.136260) fsnotify: link as dev/shm/date.txt
(00.136268) fsnotify: openable (inode match) as dev/shm/date.txt
(00.136274) fsnotify: Handle 0x1a:0x23 is openable
(00.136277) fsnotify: Dumping /dev/shm/date.txt as path for handle
(00.136280) fsnotify: id 0x000022 flags 00000000
I will try to restore this demo cpu container for now to validate too. Thanks!
It really depends how the tmpfs is setup. It could be still a bind mount into the container then it would not work.
Also, we have seen other reports concerning problems around IPC. Unfortunately it is not possible to have an IPC namespace per container in kubernetes only per pod. Which can be a problem with CRIU. I think it would be helpful if someone would extend kubernetes to allow IPC namespaces per container.
It really depends how the tmpfs is setup. It could be still a bind mount into the container then it would not work.
Ahhhh, I didn't know that it was actually binded into host's /dev/shm because of my hostIPC settings.
crictl inspect <CONTAINER>
-----
{
"destination": "/dev/shm",
"options": [
"rbind",
"rprivate",
"rw"
],
"source": "/dev/shm",
"type": "bind"
},
Hmm..I will try to tinker on pods/containers without hostIPC.
Well, I inspected normal container and it was also bind type on its own shm directory.
"destination": "/dev/shm",
"options": [
"rbind",
"rprivate",
"rw"
],
"source": "/run/containerd/io.containerd.grpc.v1.cri/sandboxes/a4789d978357b6f43a65b5651cbf533872e648a5c084dfb560f7717a41a36103/shm",
"type": "bind"
},
-------
shm on /run/containerd/io.containerd.grpc.v1.cri/sandboxes/a4789d978357b6f43a65b5651cbf533872e648a5c084dfb560f7717a41a36103/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k,inode64)
So if criu detected that the external volume is tmpfs or shm, it will usually include it within snapshot?
Hmm..for /dev/shm files of containers, I guess action script is the only option. But the hard part is enumerating which files are actually owned by specific container or something. It looks like it is not easy to expose those files outside of criu dump execution so it can be hooked within action-scripts.
Unfortunately it is not possible to have an IPC namespace per container in kubernetes only per pod.
This only occurs on multi-container pods that actively shares /dev/shm right? Fortunately, I haven't encountered this concern yet.
Thank you again Adrian for your insight! I will try to think some other options.
@ZeroExistence checkpoint/restore of /dev/shm is supported with Podman: https://github.com/containers/podman/pull/12665
We can implement support in CRI-O and containerd in the same way.
So if criu detected that the external volume is tmpfs or shm, it will usually include it within snapshot?
No. Only if the tmpfs is in the container directly mounted. A bind mount of a tmpfs is ignored
@rst0git is right. I forgot about those changes in Podman. We need to add that to CRI-O.
Thank you for confirming @adrianreber! I will open a PR for CRI-O.
Thank you for confirming @adrianreber! I will open a PR for CRI-O.
I had a closer look and it is more complicated than expected. The problem is that the IPC namespace is shared between all containers in a pod: https://github.com/kubernetes/cri-api/blob/master/pkg/apis/runtime/v1/api.proto#L342
If you have two containers in a pod they both mount /dev/shm from the pod. Which means you have multiple containers accessing it and you cannot easily add it to the container checkpoint as it is a pod level resource.
The right approach would be to teach Kubernetes to allow an IPC namespace per container. This shouldn't be a technical challenge because there are already multiple namespace which are either per node, pod or container. I talked to other people who are also struggling with the IPC namespace only per pod and node (@assafWeaversoft).
Something you could try is to open a CRI-O PR which only adds /dev/shm to the checkpoint archive if there is only one container in the pod and only allows restoring that container into an empty pod. It is a rather hacky approach, but maybe a workaround.
checkpoint/restore of /dev/shm is supported with Podman: https://github.com/containers/podman/pull/12665
Ohhhh thank you for the reference, Mr. Radostin! Based on the flow of the /dev/shm checkpointing with podman, it looks like it somehow manageable within action-script, as long as restoring it will not conflict with other pod container's /dev/shm.
The code change looks simple enough too so I will also check if this can be somehow inserted within containerd.
I will try to tinker with this when I got a chance.
PS: I will check podman criu integration next time too since some of my concerns might already address in podman but not yet dropped on other runtimes.
Im @ZeroExistence glad to meet you on this interesting topic , I am are currently working on simillar solution and would be glad to connect with you on the chalanges we have in common You can reach me on the k8s slack as well
Hi @assafWeaversoft , my pleasure to meet you too! I would love to provide you with some feedback/opinions if it can help. I will check my Slack if it is still active and join the k8s slack.
If there are already an open issue related to this topic within other repositories, maybe it might be better to link it to this issue. I would just try to provide my opinions on existing thread/discussion.
Thank you for your work!
A friendly reminder that this issue had no activity for 30 days.
Just an update on this concern. For archiving tmpfs volumes, utilizing the action scripts were enough by just identifying the source directory of the /dev/shm of container and archive/extract it.
For the concern of assigning each containers of its own /dev/shm, it seems that creating emptyDir for each containers should be enough.
For now, our concerns have been addressed without having to modify any components within Kubernetes.
Thank you very much for the guidance!
Could you share your action scripts and how you configure /dev/shm
Could you share your action scripts and how you configure /dev/shm
Sure. I will post the script and details tomorrow when I get into my station.
For the /dev/shm saving in containerd container, I just modified this script.
https://github.com/checkpoint-restore/criu/blob/criu-dev/scripts/tmp-files.sh
#!/bin/bash
# Usage:
# This action script is to manage additional metadata to be included during snapshot and restore.
# This will archive and extract /dev/shm for now for known tmpfs volumes.
#
PREDUMP="pre-dump"
POSTDUMP="post-dump"
PRERESTORE="pre-restore"
case "$CRTOOLS_SCRIPT_ACTION" in
"$PREDUMP")
mkdir -p ${PWD}/ctrd-checkpoint;
chmod 700 ${PWD}/ctrd-checkpoint;
exit $?;
;;
"$POSTDUMP")
export CONTAINER_ID=$(echo ${PWD} | rev | cut -d'/' -f1 | rev);
export CONTAINER_TMPFS=$(ctr -n k8s.io container info ${CONTAINER_ID} | jq -r '.Spec.mounts[] | select (.destination | contains("/dev/shm")) | .source')
tar --directory ${CONTAINER_TMPFS} --verbose --create --gzip --no-unquote --no-wildcards --file ${PWD}/ctrd-checkpoint/dev-shm.tar -- ./ ;
exit $?;
;;
"$PRERESTORE")
export CONTAINER_ID=$(echo ${PWD} | rev | cut -d'/' -f1 | rev);
export CONTAINER_TMPFS=$(ctr -n k8s.io container info ${CONTAINER_ID} | jq -r '.Spec.mounts[] | select (.destination | contains("/dev/shm")) | .source')
tar --directory ${CONTAINER_TMPFS} --verbose --extract --gzip --no-unquote --no-wildcards --file /var/lib/containerd/io.containerd.grpc.v1.cri/containers/${CONTAINER_ID}/dev-shm.tar;
exit $?;
;;
esac
exit 0
On this action script, we just inspect the container specs to get the source of the tmpfs then tar it. I think this can be easily applied into CRIO too with crictl. And last thing is we just need to take case on interacting on containers with hostIPC flag, since it might cause problem on other processes that actively utilizing the host's /dev/shm.
For providing per-container /dev/shm, I just do it like this.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: demo-gpu
name: demo-gpu
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: demo-gpu
template:
metadata:
labels:
app: demo-gpu
spec:
containers:
- image: ubuntu:24.04
command: ['/bin/bash','-c']
args:
- sleep inf;
securityContext:
capabilities:
add: ["IPC_LOCK"]
name: demo-cpu
volumeMounts:
- mountPath: /dev/shm
name: devshm-demo-cpu
- image: ubuntu:24.04
command: ['/bin/bash','-c']
args:
- sleep inf;
securityContext:
capabilities:
add: ["IPC_LOCK"]
name: demo-gpu
volumeMounts:
- mountPath: /dev/shm
name: devshm-demo-gpu
volumes:
- name: devshm-demo-gpu
emptyDir:
medium: Memory
- name: devshm-demo-cpu
emptyDir:
medium: Memory
These implementations address our concerns with the tmpfs.