criu icon indicating copy to clipboard operation
criu copied to clipboard

Lazy pages fails when used with podman

Open ZeyadYasser opened this issue 3 years ago • 24 comments
trafficstars

Description I am trying to do post-copy memory restore using podman to optimize response time of my container. Restore fails specifically the lazy-pages process when receiving the first pid and userfaultfd pair here uffd: recv_fd error, then subsequent pairs fail to be sent here uffd: PID sending error: Broken pipe

Steps to reproduce the issue:

  1. Container: docker.io/lloesche/valheim-server
  2. OCI Runtime: runc
  3. Run command:
sudo podman run --runtime=runc -d \
    --tmpfs /dev/shm \
    --security-opt label=disable \
    --name valheim-server \
    --cap-add=sys_nice \
    --stop-timeout 120 \
    -p 2456-2457:2456-2457/udp \
    -v $HOME/valheim-server/config:/config \
    -v $HOME/valheim-server/data:/opt/valheim \
    -e SERVER_NAME="server" \
    -e WORLD_NAME="world" \
    -e SERVER_PASS="secret" \
    lloesche/valheim-server
  1. Configuration file: /etc/criu/runc.conf
tcp-established
file-locks
lazy-pages
  1. Dump command:
sudo podman container checkpoint valheim-server
  1. Lazy pages:
sudo criu lazy-pages \
    --work-dir /var/lib/containers/storage/overlay-containers/[CONTAINER_ID]/userdata/ \
    --images-dir /var/lib/containers/storage/overlay-containers/[CONTAINER_ID]/userdata/checkpoint
  1. Restore command:
sudo podman container restore valheim-server

Describe the results you received: lazy-pages server crashes when receiving userfaultfd of first thread here:

Error (criu/uffd.c:762): uffd: recv_fd error

Describe the results you expected: Post-copy restore to work with podman.

Additional information you deem important (e.g. issue happens only occasionally):

CRIU logs and information:

CRIU full restore log: restore.log

Output of `criu --version`:

Version: 3.17
Output of `criu check --all`:

Looks good but some kernel features are missing
which, depending on your process tree, may cause
dump or restore failure.

Additional environment details: Fedora server 35

ZeyadYasser avatar Jun 13 '22 16:06 ZeyadYasser

I wasn't sure whether to post this here or at podman, I thought I should be able to do post-copy memory restore using config files without help from podman.

Thanks in advance.

ZeyadYasser avatar Jun 13 '22 17:06 ZeyadYasser

Asking it here is a better idea. I never thought about if this can work. I can look at it next week. My first thought is that this cannot work, but maybe... Usually at some point you need to specify an IP and port...

If nobody is able to help you, ping me again next week.

adrianreber avatar Jun 13 '22 17:06 adrianreber

Thanks for the quick response. I was not looking into live migrating the memory pages over TCP, just utilizing lazy pages for better restore response time until the pages are pushed from disk to the restored process. Similare to: sudo criu lazy-pages -D . sudo criu restore --lazy-pages -D . From what I understood, I think criu restore just connects to lazy-pages.socket?

ZeyadYasser avatar Jun 13 '22 17:06 ZeyadYasser

What value is in /proc/sys/vm/unprivileged_userfaultfd? CRIU requires to set it to 1.

avagin avatar Jun 13 '22 17:06 avagin

lazy-pages server crashes when receiving userfaultfd of first thread here:

could you collect a core dump and show a stack trace?

avagin avatar Jun 13 '22 18:06 avagin

judging by the restore.log userfaultfd() syscall succeeded @ZeyadYasser, is there lazy-pages.log?

rppt avatar Jun 13 '22 18:06 rppt

lazy-pages server crashes when receiving userfaultfd of first thread here:

could you collect a core dump and show a stack trace?

and you can execute criu lazy-pages under strace and attach its output here (sudo strace -fo /tmp/criu-lazy-strace.log -s 1024 criu lazy-pages -D . -o lazy-pages.log -vvvv).

avagin avatar Jun 13 '22 18:06 avagin

sudo strace -fo /tmp/criu-lazy-strace.log -s 1024 criu lazy-pages -D . -o lazy-pages.log -vvvv criu-lazy-strace.log @rppt lazy-pages.log core-dump

@avagin /proc/sys/vm/unprivileged_userfaultfd was set to 0, I changed it to 1 but still getting the same error. Does it require a restart to take effect?

ZeyadYasser avatar Jun 13 '22 18:06 ZeyadYasser

The change to /proc/sys/vm/unprivileged_userfaultfd is immediate, no need to restart

rppt avatar Jun 13 '22 18:06 rppt

@ZeyadYasser can you change pr_err("recv_fd error\n") after recv_fd() to pr_perror("recv_fd error")? Maybe errno will help...

rppt avatar Jun 13 '22 19:06 rppt

16169 write(4, "(00.002357) Running status-ready scripts\n", 41) = 41
16169 epoll_create(12)                  = 6
16169 accept(5, {sa_family=AF_UNIX}, [110 => 2]) = 7
16169 recvfrom(7, "#\0\0\0", 4, 0, NULL, NULL) = 4
16169 recvmsg(7, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0", iov_len=1}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CTRUNC}, 0) = 1
16169 write(4, "(05.355306) Error (criu/uffd.c:762): uffd: recv_fd error\n", 57) = 57

It looks like we give a small buffer for a control message. I tried to reproduce the issue in my environment, but everything works as expected... Here is the strace output from my host:

73493 write(4, "(00.005363) Running status-ready scripts\n", 41) = 41
73493 epoll_create(3)                   = 6
73493 accept(5, {sa_family=AF_UNIX}, [110->2]) = 7
73493 recvfrom(7, "/\27\1\0", 4, 0, NULL, NULL) = 4
73493 recvmsg(7, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[8]}], msg_controllen=20, msg_flags=0}, 0) = 1
73493 write(4, "(05.505902) uffd: Received PID: 71471, uffd: 8\n", 47) = 47

avagin avatar Jun 13 '22 19:06 avagin

@rppt errno is in the strace output. The problem is that MSG_CTRUNC is set and we don't receive the SCM message.

@ZeyadYasser could you apply the next patch and reproduce the issue with strace:

diff --git a/include/common/scm-code.c b/include/common/scm-code.c
index de581846b..93ea1138d 100644
--- a/include/common/scm-code.c
+++ b/include/common/scm-code.c
@@ -35,7 +35,7 @@ static int *scm_fdset_init(struct scm_fdset *fdset, struct sockaddr_un *saddr, i
        fdset->hdr.msg_namelen = saddr_len;
 
        fdset->hdr.msg_control = &fdset->msg_buf;
-       fdset->hdr.msg_controllen = CMSG_LEN(sizeof(int) * CR_SCM_MAX_FD);
+       fdset->hdr.msg_controllen = CR_SCM_MSG_SIZE; // CMSG_LEN(sizeof(int) * CR_SCM_MAX_FD);
 
        cmsg = CMSG_FIRSTHDR(&fdset->hdr);
        cmsg->cmsg_len = fdset->hdr.msg_controllen;
@@ -83,7 +83,9 @@ int __recv_fds(int sock, int *fds, int nr_fds, void *data, unsigned ch_size, int
                min_fd = min(CR_SCM_MAX_FD, nr_fds - i);
                scm_fdset_init_chunk(&fdset, min_fd, data, ch_size);
 
+               pr_err("> msg_controllen %ld\n", fdset.hdr.msg_controllen);
                ret = __sys(recvmsg)(sock, &fdset.hdr, flags);
+               pr_err("< msg_controllen %ld\n", fdset.hdr.msg_controllen);
                if (ret <= 0)
                        return ret ? __sys_err(ret) : -ENOMSG;

avagin avatar Jun 13 '22 19:06 avagin

criu-lazy-strace-2.log lazy-pages-2.log

17314 write(4, "(00.002180) uffd: Waiting for incoming connections on lazy-pages.socket\n", 72) = 72
17314 socket(AF_UNIX, SOCK_STREAM, 0)   = 5
17314 unlink("lazy-pages.socket")       = 0
17314 bind(5, {sa_family=AF_UNIX, sun_path="lazy-pages.socket"}, 19) = 0
17314 listen(5, 10)                     = 0
17314 write(4, "(00.002392) Running status-ready scripts\n", 41) = 41
17314 epoll_create(12)                  = 6
17314 accept(5, {sa_family=AF_UNIX}, [110 => 2]) = 7
17314 recvfrom(7, "\30\0\0\0", 4, 0, NULL, NULL) = 4
17314 write(4, "(11.006320) Error (include/common/scm-code.c:86): > msg_controllen 20\n", 70) = 70
17314 recvmsg(7, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0", iov_len=1}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CTRUNC}, 0) = 1
17314 write(4, "(11.006786) Error (include/common/scm-code.c:88): < msg_controllen 0\n", 69) = 69
17314 write(4, "(11.006815) Error (criu/uffd.c:762): uffd: recv_fd error\n", 57) = 57

ZeyadYasser avatar Jun 13 '22 19:06 ZeyadYasser

It doesn't seem to have anything related to the container itself. Tested the flow on a generic container gave the same results. sudo podman run --runtime=runc -dt -p 8080:80/tcp docker.io/library/httpd

ZeyadYasser avatar Jun 13 '22 20:06 ZeyadYasser

@ZeyadYasser I think it can be a kernel problem. You can try to reproduce this issue with our tests: $ python test/zdtm.py run -t zdtm/static/env00 --pre 2 --lazy-pages

What kernel do you use?

I have found one minor issue in our code. Could you try the next patch?

diff --git a/include/common/scm-code.c b/include/common/scm-code.c
index de581846b..c1a3a7bd4 100644
--- a/include/common/scm-code.c
+++ b/include/common/scm-code.c
@@ -7,10 +7,10 @@ static void scm_fdset_init_chunk(struct scm_fdset *fdset, int nr_fds, void *data
        struct cmsghdr *cmsg;
        static char dummy;
 
-       fdset->hdr.msg_controllen = CMSG_LEN(sizeof(int) * nr_fds);
+       fdset->hdr.msg_controllen = CMSG_SPACE(sizeof(int) * nr_fds);
 
        cmsg = CMSG_FIRSTHDR(&fdset->hdr);
-       cmsg->cmsg_len = fdset->hdr.msg_controllen;
+       cmsg->cmsg_len = CMSG_LEN(sizeof(int) * nr_fds);
 
        if (data) {
                fdset->iov.iov_base = data;
@@ -35,7 +35,7 @@ static int *scm_fdset_init(struct scm_fdset *fdset, struct sockaddr_un *saddr, i
        fdset->hdr.msg_namelen = saddr_len;
 
        fdset->hdr.msg_control = &fdset->msg_buf;
-       fdset->hdr.msg_controllen = CMSG_LEN(sizeof(int) * CR_SCM_MAX_FD);
+       fdset->hdr.msg_controllen = CMSG_SPACE(sizeof(int) * CR_SCM_MAX_FD);
 
        cmsg = CMSG_FIRSTHDR(&fdset->hdr);
        cmsg->cmsg_len = fdset->hdr.msg_controllen;

If it doesn't fix the issue, could you apply the next patch and reproduce the issue with strace. criu restore can be executed under strace too.

diff --git a/include/common/scm-code.c b/include/common/scm-code.c
index de581846b..1d7f5be3e 100644
--- a/include/common/scm-code.c
+++ b/include/common/scm-code.c
@@ -35,7 +35,7 @@ static int *scm_fdset_init(struct scm_fdset *fdset, struct sockaddr_un *saddr, i
        fdset->hdr.msg_namelen = saddr_len;
 
        fdset->hdr.msg_control = &fdset->msg_buf;
-       fdset->hdr.msg_controllen = CMSG_LEN(sizeof(int) * CR_SCM_MAX_FD);
+       fdset->hdr.msg_controllen = CR_SCM_MSG_SIZE; // CMSG_LEN(sizeof(int) * CR_SCM_MAX_FD);
 
        cmsg = CMSG_FIRSTHDR(&fdset->hdr);
        cmsg->cmsg_len = fdset->hdr.msg_controllen;
@@ -80,10 +80,22 @@ int __recv_fds(int sock, int *fds, int nr_fds, void *data, unsigned ch_size, int
 
        cmsg_data = scm_fdset_init(&fdset, NULL, 0);
        for (i = 0; i < nr_fds; i += min_fd) {
-               min_fd = min(CR_SCM_MAX_FD, nr_fds - i);
-               scm_fdset_init_chunk(&fdset, min_fd, data, ch_size);
+               char dummy;
 
+               min_fd = min(CR_SCM_MAX_FD, nr_fds - i);
+               fdset.hdr.msg_controllen = CR_SCM_MSG_SIZE;//CMSG_LEN(sizeof(int) * nr_fds);
+               if (data) {
+                       fdset.iov.iov_base = data;
+                       fdset.iov.iov_len = ch_size;
+               } else {
+                       fdset.iov.iov_base = &dummy;
+                       fdset.iov.iov_len = 1;
+               }
+               //scm_fdset_init_chunk(&fdset, min_fd, data, ch_size);
+
+               pr_err("> msg_controllen %ld\n", fdset.hdr.msg_controllen);
                ret = __sys(recvmsg)(sock, &fdset.hdr, flags);
+               pr_err("< msg_controllen %ld\n", fdset.hdr.msg_controllen);
                if (ret <= 0)
                        return ret ? __sys_err(ret) : -ENOMSG;

avagin avatar Jun 14 '22 00:06 avagin

zdtm/static/env00 test passes. I applied the first patch, the problem unfortunately still persists. Linux: Linux pc 5.17.13-200.fc35.x86_64 #1 SMP PREEMPT Mon Jun 6 14:38:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux criu-restore-strace.log criu-lazy-strace.log

ZeyadYasser avatar Jun 15 '22 21:06 ZeyadYasser

I noticed that if I changed sendfd to an fd other than the thread's uffd, the fd is sent successfully. I just tried sending fd 1. https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/uffd.c#L224

ZeyadYasser avatar Jun 15 '22 21:06 ZeyadYasser

I applied the first patch, the problem unfortunately still persists.

Could you check that you attached the right logs?

avagin avatar Jun 15 '22 23:06 avagin

I was able to reproduce this issue on my host. Something is wrong with selinux rules. In ftrace output, I see this call trace: scm_detach_fds -> __receive_fd -> security_file_receive -> selinux_file_receive->...

bash-5.1# ausearch -c criu --raw
...
type=AVC msg=audit(1655353538.827:25840): avc:  denied  { read write } for  pid=565155 comm="criu" path="anon_inode:[userfaultfd]" dev="anon_inodefs" ino=1031810 scontext=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 tcontext=unconfined_u:object_r:container_runtime_t:s0 tclass=anon_inode permissive=1

@adrianreber we need your help here.

avagin avatar Jun 16 '22 04:06 avagin

I was able to reproduce this issue on my host. Something is wrong with selinux rules. In ftrace output, I see this call trace: scm_detach_fds -> __receive_fd -> security_file_receive -> selinux_file_receive->...

bash-5.1# ausearch -c criu --raw
...
type=AVC msg=audit(1655353538.827:25840): avc:  denied  { read write } for  pid=565155 comm="criu" path="anon_inode:[userfaultfd]" dev="anon_inodefs" ino=1031810 scontext=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 tcontext=unconfined_u:object_r:container_runtime_t:s0 tclass=anon_inode permissive=1

@adrianreber we need your help here.

You are absolutely right, I tried temporarily disabling selinux and it worked on my host. @avagin Thank you so much!

ZeyadYasser avatar Jun 16 '22 05:06 ZeyadYasser

At the moment we have a blanket rule in selinux-policy to allow all processes to use userfaultfd (and other anon_inodes), but we didn't account for the case where such file descriptor is sent from one domain to another... Short term, I think we should change allow domain self:anon_inode common_anon_inode_perms; in policy/modules/kernel/domain.te to allow domain domain:anon_inode common_anon_inode_perms; so that inter-process uffd passing is not blocked. (@zpytela will likely want you to file a Fedora BZ ticket for that change, or maybe he'll open a PR right away.)

Long term, we are working on a better support of the "anon_inode" class in selinux-policy, where individual domains will need rules based on what operations they do over the uffd - I took a note to make sure container_runtime_t gets the necessary rules when we do that.

WOnder93 avatar Jun 17 '22 14:06 WOnder93

@WOnder93 you mean like this? https://github.com/fedora-selinux/selinux-policy/pull/1252 I am sure the commit message would use some improvement.

zpytela avatar Jun 22 '22 19:06 zpytela

@zpytela Sorry, I forgot to reply here... Yes, that's pretty much what I had in mind.

WOnder93 avatar Jul 08 '22 13:07 WOnder93

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] avatar Aug 08 '22 00:08 github-actions[bot]