criu
criu copied to clipboard
docker checkpoint creation not working on AWS EBS but when EFS is mounted it works
When I'm trying to do docker checkpoints on AWS EC2 Ubuntu 20.04 LTS with standard gp2 EBS storage I'm getting the error as below, but after mounting default Docker dir to AWS EFS (NFS) it's suddenly starting to work.
~# docker run -d --name c1 alpine:latest sh -c "while true; do date; sleep 2; done"
7f869efa90477db1a90c3a44474d1613fe27c6a6ea4442d88d957d38e500d49b
~# docker checkpoint create c1 asd
Error response from daemon: Cannot checkpoint container c1: runc did not terminate successfully: exit status 1: criu failed: type NOTIFY errno 0 path= /run/containerd/io.containerd.runtime.v2.task/moby/7f869efa90477db1a90c3a44474d1613fe27c6a6ea4442d88d957d38e500d49b/criu-dump.log: unknown
~# tail -17 /run/containerd/io.containerd.runtime.v2.task/moby/7f869efa90477db1a90c3a44474d1613fe27c6a6ea4442d88d957d38e500d49b/criu-dump.log
(00.005299) Dumping task (pid: 14170)
(00.005302) ========================================
(00.005304) Obtaining task stat ...
(00.005327)
(00.005329) Collecting mappings (pid: 14170)
(00.005331) ----------------------------------------
(00.005390) Found regular file mapping, OK
(00.005435) Error (criu/files-reg.c:1710): Can't lookup mount=679 for fd=-3 path=/bin/busybox
(00.005445) Error (criu/cr-dump.c:1524): Collect mappings (pid: 14170) failed with -1
(00.005473) Unlock network
(00.005475) Running network-unlock scripts
(00.005477) RPC
(00.007890) Unfreezing tasks into 1
(00.007900) Unseizing 14170 into 1
(00.007910) Unseizing 14219 into 1
(00.007930) Error (criu/cr-dump.c:2053): Dumping FAILED.
~# mkdir /var/lib/docker_shared
~# mount -t efs -o _netdev,wsize=1048576000,tls,accesspoint=${EFS_ACCESS_POINT|} ${EFS_DNS}:/ /var/lib/docker_shared
~# echo "${EFS_DNS}:/ /var/lib/docker_shared _netdev,wsize=1048576000,tls,accesspoint=${EFS_ACCESS_POINT|} 0 0" | cat >> /etc/fstab
~# mkdir /var/lib/docker_shared/$EC2_INSTANCE_ID
~# cp /lib/systemd/system/docker.service /etc/systemd/system/
~# sed -i "s/\ -H\ fd:\/\// -g \/var\/lib\/docker_shared\/$EC2_INSTANCE_ID/g" /etc/systemd/system/docker.service
~# systemctl daemon-reload
~# systemctl restart docker.service
~# docker run -d --name c1 alpine:latest sh -c "while true; do date; sleep 2; done"
Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
df9b9388f04a: Pull complete
Digest: sha256:4edbd2beb5f78b1014028f4fbb99f3237d9561100b6881aabbf5acce2c4f9454
Status: Downloaded newer image for alpine:latest
6367509770ff6503a6d9d9856a58eed78570ce34747d70ef1a67bf2418cebf5c
~# docker checkpoint create c1 asd
asd
~# df -hT
Filesystem Type Size Used Avail Use% Mounted on
/dev/root ext4 59G 3.5G 55G 7% /
devtmpfs devtmpfs 3.8G 0 3.8G 0% /dev
tmpfs tmpfs 3.8G 0 3.8G 0% /dev/shm
tmpfs tmpfs 774M 980K 774M 1% /run
tmpfs tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs tmpfs 3.8G 0 3.8G 0% /sys/fs/cgroup
/dev/loop0 squashfs 26M 26M 0 100% /snap/amazon-ssm-agent/5656
/dev/loop1 squashfs 56M 56M 0 100% /snap/core18/2344
/dev/loop2 squashfs 62M 62M 0 100% /snap/core20/1434
/dev/loop3 squashfs 45M 45M 0 100% /snap/snapd/15534
/dev/loop4 squashfs 68M 68M 0 100% /snap/lxd/22753
tmpfs tmpfs 774M 0 774M 0% /run/user/1000
127.0.0.1:/ nfs4 8.0E 9.0M 8.0E 1% /var/lib/checkpoints
127.0.0.1:/ nfs4 8.0E 9.0M 8.0E 1% /var/lib/docker_shared
I tried to use different EBS storage (io1) and it did not fix the problem.
There is a bug in Ubuntu concerning overlayfs which you seem to be hitting. If you try to upgrade to the latest kernel the bug may be fixed. Not sure. If you try another distribution you should not hit this bug. It is Ubuntu only.
What kernel do you use? Could you show /proc/pid/mountinfo from a container?
5.13.0-1025-aws 5.13.0-1023-aws cat: /proc/pid: No such file or directory
cat: /proc/pid: No such file or directory
docker exec NAME cat /proc/1/mountinfo
I think @adrianreber is right, it is the known issue of the ubuntu kernel: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1857257
A friendly reminder that this issue had no activity for 30 days.
A friendly reminder that this issue had no activity for 30 days.