criu Allow CRIU to be used as non-root (Take 2)

We at the OpenJ9 and Open Liberty projects have been experimenting with CRIU to improve JVM and Java web application start-up times, primarily in container deployments.[^1] To that end we've been testing #1155 for a while now and having success and we want to help get it across the finish line.

With @adrianreber's agreement I've rebased the patches in #1155 and added 4 additional patches as follows:

Patches to add the --unprivileged option to libcriu and the RPC interface, and a fix to do the required check_caps() on the service worker path, equivalent to what's done on the tool path:

3807cc9cd non-root: Call check_caps() in service worker mode as well
8df76f09b non-root: Allow unprivileged flag to be set via libcriu

Patches to move some new code (that I believe requires root) recently landed in criu-dev to the root-only init path introduced by in #1155:

b32f887ad non-root: Do kerndat_has_move_mount_set_group() in root-only init
f42e25fe7 non-root: Do kerndat_has_nftables_concat() in root-only init

As suggested by @adrianreber I'm currently working on where to store criu.kdat, if/how to use XDG_RUNTIME_PATH for that purpose, and what to do when it's not set (e.g. when running via sudo).

I'm also looking over #1155 to see if there are any other unresolved questions.

In the meantime I wanted to open this PR now in case anyone has any fresh thoughts on the subject, comments on how to proceed, and so on.

[^1]: This use-case is a bit different from using CRIU to facilitate process migration and has some unique challenges, including the fact that we're using CRIU "under the covers" in a scenario that to end-users shouldn't appear too different from simply starting a process from scratch. Allowing non-root users to restore processes, particularly inside unprivileged containers, makes this use-case much more accessible to end-users because it reduces the privileges required to "start" a process in this way, relative to starting the process from scratch.

Jun 28 '22 13:06 ymanton

@ymanton Thanks for picking up #1155

I had following proposal in a chat how to deal with kdat as non-root. Currently the nice thing about the kdat file is that will disappear after a reboot because it is on a tmpfs. Unfortunately as non-root there is no easy way to allow all users to write to /run. My proposal was to use XDG_RUNTIME_DIR (like I have seen it by container engines) if it is defined. It usually points to a tmpfs. This way we automatically have a location for each user for kdat. Unfortunately XDG_RUNTIME_DIR is not set when doing su/sudo. Only a real login session set up by systemd has XDG_RUNTIME_DIR pointing to the right location. If XDG_RUNTIME_DIR is not set or not writeable, non-root CRIU should fall back to just ignore kdat and run all the checks all the time. A correctly set up session will correctly use kdat but a session through sudo/su will still work, although CRIU might be slower as it has to redo all checks everytime.

Jun 28 '22 14:06 adrianreber

Went back and fixed the kdat file XDG_RUNTIME_DIR patch to allow the tmpfs check be done in non-root mode as well (was previously skipped since the kdat file was being stored in HOME).
Fixed some linking issues affecting unittest (it now uses some of the new caps util funcs @adrianreber added but the function defs were not visible at link time).
Addressed some review comments.

Jul 25 '22 19:07 ymanton

Thanks for working on this. I want to ask one more thing. We are trying to support the commit history in the main tree as clean as possible. The main idea is very similar to the Linux kernel process. In this PR, I see that a few patches fix problems introduced by other patches. I think we need to merge all fixes in proper changes. I understand that the origin patches are authored by @adrianreber. I think we can use tags like Co-authored-by, Co-developed-by, Originally-by to mention all developers involved in the process.

Aug 10 '22 16:08 avagin

Thanks for working on this. I want to ask one more thing. We are trying to support the commit history in the main tree as clean as possible. The main idea is very similar to the Linux kernel process. In this PR, I see that a few patches fix problems introduced by other patches. I think we need to merge all fixes in proper changes. I understand that the origin patches are authored by @adrianreber. I think we can use tags like Co-authored-by, Co-developed-by, Originally-by to mention all developers involved in the process.

No problem, thanks for letting me know. I've merged all my changes together with @adrianreber's and added Co-authored-bys to the patches that were significantly altered. Going forward I'll maintain them this way as things progress.

Aug 15 '22 22:08 ymanton

A friendly reminder that this PR had no activity for 30 days.

Sep 26 '22 01:09 github-actions[bot]

Hi, apologies if this is the wrong way to address this (i am fairly new to using open-source gh stuff, please redirect me to the correct place if this is the wrong one). I've been doing some testing of this branch to see if it might be useful for my use-case when merged, and in order to dump a process, i need to give criu the cap_sys_ptrace capability in addition to cap_checkpoint_restore or cap_sys_admin or else it fails with this error message:

Warn  (compel/src/lib/infect.c:133): Unable to interrupt task: 3891500 (Operation not permitted)
Error (compel/src/lib/infect.c:418): Unable to detach from 3891500: No such process
Error (criu/cr-dump.c:2059): Dumping FAILED.

but when given cap_sys_ptrace it works as expected for processes not requiring the --shell-job / -j option.

Furthermore, processes that require the -j option are dumpable with cap_checkpoint_restore and cap_sys_ptrace but must be restored with cap_sys_admin due to failing on an ioctl(fd , TIOCSLCKTRMIOS) call which requires cap_sys_admin according to the manpage. If there is anything I can do about this, or if I should be able to do this, but am doing something wrong please let me know.

I hope this information is at least somewhat useful to you.

Oct 06 '22 15:10 anna-singleton

@anna-singleton Everything you said is correct as far as I remember. non-root checkpoint/restore is limited in what it can do at this point and to avoid CAP_SYS_ADMIN for restore you currently cannot not use -j. Using a process without stdin/stdout/stderr should work.

Oct 06 '22 15:10 adrianreber

alright thanks for the confirmation. Would it be helpful to add the information about cap_sys_ptrace in the NON-ROOT section of the manpage? I am happy to do this (i presume i can PR the branch), since it would help others trying to use --unprivileged since I could not find this documented previously.

Oct 06 '22 15:10 anna-singleton

@anna-singleton @ymanton The author of this PR talked about this PR during Linux Plumbers Conference a couple of weeks ago and I am pretty sure he mentioned CAP_SYS_PTRACE. @ymanton you did, right?

I think CAP_SYS_PTRACE can be avoid if /proc/sys/kernel/yama/ptrace_scope is set to 0. I think. Not sure. It is 0 on my systems.

@anna-singleton maybe you could provide a follow-up PR once this is merged. Not sure. Maybe @ymanton has an idea.

Oct 06 '22 15:10 adrianreber

Yeah I ran into these kinds of issues as well and talked about them a bit, but forgot about the manpage patch in this PR.

Adding CAP_SYS_PTRACE and/or ptrace_scope and other such limitations is a good idea. @anna-singleton feel free to suggest changes via PR or the GitHub suggestions mechanism or whatever else works. I'll update the manpage text as well with any relevant info before this gets merged.

Oct 06 '22 18:10 ymanton

@ymanton I think I am ready to merge it to the criu-dev branch, but you need to fix CI failures.

Oct 18 '22 06:10 avagin

Why CI checks have not been executed?..

Oct 19 '22 14:10 avagin

Why CI checks have not been executed?..

There was an GitHub Actions outage yesterday. @ymanton please push once more to trigger CI.

Oct 19 '22 14:10 adrianreber

The cross-compile test results fail sometimes because of broken repositories like in this case and can be ignored.

The CentOS 7 errors are because of a package installation problem. @rst0git I think the epel package installation fix is from you. Can you maybe add a ||: after the epel activation line. The problem is that on CentOS 7 epel might already be installed, if you test it in RHEL the package is not installed. CentOS 7 is anyway not relevant for the non-root CRIU operation as the kernel is too old.

The Vagrant errors seem to be something new for shared memory already tracked somewhere else: #1982

The non-root Vagrant test setup is running successfully.

I think from the CI side this is ready.

Oct 20 '22 06:10 adrianreber

@adrianreber Thanks for double checking, I was unsure about Vagrant Fedora based test (no VDSO) and started looking at it.

Oct 20 '22 13:10 ymanton

@ymanton thank you for moving this pr to the finish line. @adrianreber thank you for implementing the kernel part and the initial version of the userspace changes.

Oct 25 '22 14:10 avagin

Perhaps this should better be asked in a separate issue, but I'm not sure if what I'm asking about is reasonable, so here goes.

One thing this PR has not handled at all is map_files. Problem is, proc(5) says that

Capabilities are required to read the contents of the symbolic links in this directory: before Linux 5.9, the reading process requires CAP_SYS_ADMIN in the initial user namespace; since Linux 5.9, the reading process must have either CAP_SYS_ADMIN or CAP_CHECKPOINT_RESTORE in the user namespace where it resides.

...but that does not match what the kernel does, e.g. on HEAD we have

https://github.com/torvalds/linux/blob/27bc50fc90647bbf7b734c3fc306a5e61350da53/fs/proc/base.c#L2249

...which checks whether CAP_CHECKPOINT_RESTORE is present in the init user namespace. This means that reading map_files always fails in containers. Now one would think this is seldom needed, but the kernel uses a deleted /dev/zero device for every shared anonymous mapping:

https://github.com/torvalds/linux/blob/aae703b02f92bde9264366c545e87cec451de471/mm/shmem.c#L4243-L4264

I am not sure if deleted files can be gracefully handled in the general case, but the comment suggests so. Would this be an incentive to implement that sooner than later?

UPD: actually, the current logic seems fishy as is:

if (is_memfd(vfi_dev)) {
    ...
}
if (is_hugetlb_dev(vfi_dev, &hugetlb_flag) || is_anon_shmem_map(vfi_dev)) {
    ...
}

because the implementation of is_memfd matches that of is_anon_shmem_map.

Oct 27 '22 23:10 purplesyringa

You certainly know what you are doing more than me, but perhaps it might be of interest to you to know what I needed to patch to get CRIU of httpd in podman working:

Patch

diff --git a/criu/proc_parse.c b/criu/proc_parse.c
index 946b0fc40..a24783da7 100644
--- a/criu/proc_parse.c
+++ b/criu/proc_parse.c
@@ -313,25 +313,8 @@ static int vma_get_mapfile_user(const char *fname, struct vma_area *vma, struct
 
 	vfi_dev = makedev(vfi->dev_maj, vfi->dev_min);
 
-	if (is_memfd(vfi_dev)) {
-		char tmp[PATH_MAX];
-		strlcpy(tmp, fname, PATH_MAX);
-		strip_deleted(tmp, strlen(tmp));
-
-		/*
-		 * The error EPERM will be shown in the following pr_perror().
-		 * It comes from the previous open() call.
-		 */
-		pr_perror("Can't open mapped [%s]", tmp);
-
-		/*
-		 * TODO Perhaps we could do better than failing and dump the
-		 * memory like what is being done in shmem.c
-		 */
-		return -1;
-	}
-
 	if (is_hugetlb_dev(vfi_dev, &hugetlb_flag) || is_anon_shmem_map(vfi_dev)) {
+		vma->e->status |= VMA_AREA_REGULAR;
 		if (!(vma->e->flags & MAP_SHARED))
 			vma->e->status |= VMA_ANON_PRIVATE;
 		else
diff --git a/criu/shmem.c b/criu/shmem.c
index 81e701586..2de2ea9af 100644
--- a/criu/shmem.c
+++ b/criu/shmem.c
@@ -724,7 +724,7 @@ static int next_data_segment(int fd, unsigned long pfn, unsigned long *next_data
 	return 0;
 }
 
-static int do_dump_one_shmem(int fd, void *addr, struct shmem_info *si)
+static int do_dump_one_shmem(int fd, void *addr, struct shmem_info *si, bool seek_data_supported)
 {
 	struct page_pipe *pp;
 	struct page_xfer xfer;
@@ -750,7 +750,7 @@ static int do_dump_one_shmem(int fd, void *addr, struct shmem_info *si)
 		unsigned long pgaddr;
 		int st = -1;
 
-		if (pfn >= next_hole_pfn && next_data_segment(fd, pfn, &next_data_pnf, &next_hole_pfn))
+		if (seek_data_supported && pfn >= next_hole_pfn && next_data_segment(fd, pfn, &next_data_pnf, &next_hole_pfn))
 			goto err_xfer;
 
 		if (si->pstate_map && is_shmem_tracking_en()) {
@@ -808,20 +808,50 @@ static int dump_one_shmem(struct shmem_info *si)
 {
 	int fd, ret = -1;
 	void *addr;
+	bool seek_data_supported;
 
 	pr_info("Dumping shared memory %ld\n", si->shmid);
 
-	fd = open_proc(si->pid, "map_files/%lx-%lx", si->start, si->end);
-	if (fd < 0)
-		goto err;
 
-	addr = mmap(NULL, si->size, PROT_READ, MAP_SHARED, fd, 0);
-	if (addr == MAP_FAILED) {
-		pr_err("Can't map shmem 0x%lx (0x%lx-0x%lx)\n", si->shmid, si->start, si->end);
-		goto errc;
+	fd = __open_proc(si->pid, EPERM, O_RDONLY, "map_files/%lx-%lx", si->start, si->end);
+	if (fd >= 0) {
+		addr = mmap(NULL, si->size, PROT_READ, MAP_SHARED, fd, 0);
+		if (addr == MAP_FAILED) {
+			pr_err("Can't map shmem 0x%lx (0x%lx-0x%lx)\n", si->shmid, si->start, si->end);
+			goto errc;
+		}
+
+		seek_data_supported = true;
+	} else {
+		if(errno != EPERM) {
+			goto err;
+		}
+
+		fd = open_proc(si->pid, "mem");
+		if(fd < 0) {
+			goto err;
+		}
+
+		addr = mmap(NULL, si->size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+		if (addr == MAP_FAILED) {
+			pr_err("Can't map empty space for shmem 0x%lx (0x%lx-0x%lx)\n", si->shmid, si->start, si->end);
+			goto errc;
+		}
+
+		if(lseek(fd, si->start, SEEK_SET) < 0) {
+			pr_perror("Can't seek virtual memory");
+			return -1;
+		}
+
+		if(read(fd, addr, si->size) < si->size) {
+			pr_perror("Can't read virtual memory");
+			return -1;
+		}
+
+		seek_data_supported = false;
 	}
 
-	ret = do_dump_one_shmem(fd, addr, si);
+	ret = do_dump_one_shmem(fd, addr, si, seek_data_supported);
 
 	munmap(addr, si->size);
 errc:
@@ -849,7 +879,7 @@ int dump_one_memfd_shmem(int fd, unsigned long shmid, unsigned long size)
 		goto err;
 	}
 
-	ret = do_dump_one_shmem(fd, addr, &si);
+	ret = do_dump_one_shmem(fd, addr, &si, true);
 
 	munmap(addr, size);
 err:
@@ -875,7 +905,7 @@ int dump_one_sysv_shmem(void *addr, unsigned long size, unsigned long shmid)
 	if (fd < 0)
 		return -1;
 
-	ret = do_dump_one_shmem(fd, addr, si);
+	ret = do_dump_one_shmem(fd, addr, si, true);
 	close(fd);
 	return ret;
 }
diff --git a/criu/sockets.c b/criu/sockets.c
index db772707b..d50866c65 100644
--- a/criu/sockets.c
+++ b/criu/sockets.c
@@ -12,6 +12,7 @@
 
 #include "int.h"
 #include "bitops.h"
+#include "cr_options.h"
 #include "libnetlink.h"
 #include "sockets.h"
 #include "unix_diag.h"
@@ -457,6 +458,27 @@ int sk_collect_one(unsigned ino, int family, struct socket_desc *d, struct ns_id
 
 int do_restore_opt(int sk, int level, int name, void *val, int len)
 {
+	void *buf;
+	socklen_t cur_len;
+	bool match;
+
+	if(opts.unprivileged) {
+		buf = xmalloc(len);
+		if (!buf) {
+			return -1;
+		}
+		cur_len = len;
+		if (getsockopt(sk, level, name, buf, &cur_len) < 0) {
+			pr_perror("Can't get %d:%d (len %d)", level, name, len);
+			return -1;
+		}
+		match = cur_len == len && memcmp(buf, val, len) == 0;
+		xfree(buf);
+		if(match) {
+			return 0;
+		}
+	}
+
 	if (setsockopt(sk, level, name, val, len) < 0) {
 		pr_perror("Can't set %d:%d (len %d)", level, name, len);
 		return -1;
@@ -469,9 +491,20 @@ static int sk_setbufs(void *arg, int fd, pid_t pid)
 {
 	u32 *buf = (u32 *)arg;
 
-	if (restore_opt(fd, SOL_SOCKET, SO_SNDBUFFORCE, &buf[0]))
+	int snd_opt_name;
+	int rcv_opt_name;
+
+	if (!opts.unprivileged) {
+		snd_opt_name = SO_SNDBUFFORCE;
+		rcv_opt_name = SO_RCVBUFFORCE;
+	} else {
+		snd_opt_name = SO_SNDBUF;
+		rcv_opt_name = SO_RCVBUF;
+	}
+
+	if (restore_opt(fd, SOL_SOCKET, snd_opt_name, &buf[0]))
 		return -1;
-	if (restore_opt(fd, SOL_SOCKET, SO_RCVBUFFORCE, &buf[1]))
+	if (restore_opt(fd, SOL_SOCKET, rcv_opt_name, &buf[1]))
 		return -1;
 
 	return 0;

The omission of vma->e->status |= VMA_AREA_REGULAR; seems to be a somewhat unrelated bug that should probably have been noticed earlier. Using lseek + read + vmsplice is inefficient, I guess process_vm_readv should be used instead, but I just wanted to get it working first. I also updated do_restore_opt to not set options if they are already correct because we might not have enough permissions for that, and SO_RCVBUF/SO_SNDBUF are used instead of their *FORCE counterparts in unprivileged mode. Maybe some of this is technically wrong, please take a better look (e.g. perhaps inaccessible map_files should be handled similarly in dump_one_sysv_shmem too).

I tested this inside

podman run -it --cap-add=cap_checkpoint_restore --cap-add=cap_sys_ptrace --cap-add=cap_sys_resource --security-opt seccomp=unconfined --security-opt unmask=/proc/sys php:8.1-apache bash

CAP_CHECKPOINT_RESTORE is needed for obvious reasons, CAP_SYS_PTRACE is for process_vm_readv IIRC, CAP_SYS_RESOURCE is for rlimit, seccomp is so that vmsplice does not yield EPERM (I don't quite understand the reason for this one tbh), /proc/sys is unmasked for ns_next_pid. CAP_NET_ADMIN is not used due to a bug described in one of my review comments on this PR.

Hope this helps someone.

Oct 28 '22 02:10 purplesyringa

@imachug Thanks for testing, I'll get back to you shortly on your comments.

Oct 28 '22 15:10 ymanton

Perhaps this should better be asked in a separate issue, but I'm not sure if what I'm asking about is reasonable, so here goes.

One thing this PR has not handled at all is map_files. Problem is, proc(5) says that

Capabilities are required to read the contents of the symbolic links in this directory: before Linux 5.9, the reading process requires CAP_SYS_ADMIN in the initial user namespace; since Linux 5.9, the reading process must have either CAP_SYS_ADMIN or CAP_CHECKPOINT_RESTORE in the user namespace where it resides.

...but that does not match what the kernel does, e.g. on HEAD we have

https://github.com/torvalds/linux/blob/27bc50fc90647bbf7b734c3fc306a5e61350da53/fs/proc/base.c#L2249

...which checks whether CAP_CHECKPOINT_RESTORE is present in the init user namespace. This means that reading map_files always fails in containers.

You're right about the man-page, it appears to be incorrect. I've sent an email to the maintainers and list and CC'd you. Thanks for catching that. As for what to do about it or how to improve it, I'm not exactly sure. Wouldn't the kernel behaviour have to change or is there something you think CRIU can do better if we're not in the init user ns but need access to map_files contents?

Nov 01 '22 17:11 ymanton

Can we avoid using map_files in this case and read the memory of the process directly?

Nov 01 '22 17:11 purplesyringa

Ah, I guess I see what the problem is: we want to dump the whole file, and mmap may only map the file partially. Moreover, for deleted files, we might not even know if two mappings are of the same file or not. Given that mremap does not really work for shared memory, I guess the best that can be done is a) attempting to modify one mapping and checking how it affects others, to figure out which mappings are parts of the same file, and then b) dumping different parts of each file from memory of different processes, if no single process contains the whole file. This leaves the case when a part of a file is not mapped anywhere, but then we can just fill it with zeroes -- that should be indistinguishable from the "real" file if map_files is unavailable.

Nov 01 '22 17:11 purplesyringa

criu criu copied to clipboard

Allow CRIU to be used as non-root (Take 2)

criu
criu copied to clipboard