criu icon indicating copy to clipboard operation
criu copied to clipboard

Allow CRIU to be used as non-root with CAP_CHECKPOINT_RESTORE or CAP_SYS_ADMIN

Open adrianreber opened this issue 5 years ago • 42 comments

With the almost finished discussion about CAP_CHECKPOINT_RESTORE I reworked my test code to be ready for review.

This contains the changes to run CRIU as non-root with either CAP_SYS_ADMIN or CAP_CHECKPOINT_RESTORE. As Travis does not yet support CAP_CHECKPOINT_RESTORE I am using CAP_SYS_ADMIN during Travis testing, but locally it works just the same with CAP_CHECKPOINT_RESTORE.

This PR checks if the euid is non-zero and if it is non-zero it checks if CRIU has either CAP_SYS_ADMIN or CAP_CHECKPOINT_RESTORE. If that is true CRIU remembers the euid and decides at different places in the code to not do certain things which do not work as non-root.

This is a minimal start of non-root CRIU and it can be used with zdtm to test with env00. As seen in the patch.

This is based on @nviennot and my test branches for the lkml CAP_CHECKPOINT_RESTORE discussions.

adrianreber avatar Jul 25 '20 13:07 adrianreber

1. We need solid testing without CAP_SYS_ADMIN, where we run with just with CAP_CHECKPOINT_RESTORE and CAP_PTRACE, as advertised. Otherwise, we can't develop properly. This may require us to use a custom kernel (could be compiled and put as a binary blob in some repo we own) that we run with qemu.

Thanks to @brauner the CAP_CHECKPOINT_RESTORE changes have been merged and should be available with RC1 (https://github.com/torvalds/linux/commit/74858abbb1032222f922487fd1a24513bbed80f9). As we are already using Vagrant to test the latest Fedora release we can easily add another Vagrant based test and install the latest RC1 kernel via https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories which provides all RC releases:

https://repos.fedorapeople.org/repos/thl/kernel-vanilla-mainline/fedora-32/x86_64/

2. CLI options: should we require users to manually disable features (like network), or automatically disable features when we see we don't have the right capability. Another way to look at this is to be tolerant as much as we can to EPERM errors.

Any reason to not automatically disable features if we already detect that it will not work?

3. CRIU should not look at uid. Rather, it should look at its effective capabilities (and perhaps whether it is running in a user namespace, as it makes all the `capable(CAP_*)` tests fail in the kernel, which matters for things like `SO_SNDBUFFORCE`).

Sounds good. Like you mentioned above looking at /proc/self/status sounds like a good approach.

4. At some point, we should motivate the docker/kubernetes people to add CAP_CHECKPOINT_RESTORE in the set of bounded capabilities (so users don't have to add it manually).

I can help

Thanks. In your review you mentioned a lot of different capabilities we should test and this sounds like a good approach. I would start with reading /proc/self/status and only focus on CAP_CHECKPOINT_RESTORE for now before looking at what other capabilities are needed. This could be done in follow up pull requests.

adrianreber avatar Aug 05 '20 10:08 adrianreber

  1. Wonderful

Any reason to not automatically disable features if we already detect that it will not work?

Sometimes, failing can be preferable:

  1. Let say that one is using CRIU for some time with some settings that they like in a managed kubernetes environment.
  2. The hosting provider changes the kubernetes environment, changing the default capabilities that a container has
  3. CRIU will then skip certain things to checkpoint (or restore) as a result. However, the user will not know about the change of behavior. This can potentially lead to an silent application failure.

So I'd argue that we want a deterministic behavior given a set of CLI options.

I see two approaches: a) We can skip individual features: So we can throw errors like "CAP_SYS_ADMIN is not detected. Please retry with --skip-lsm --skip-X -skip-Y" where X and Y are features that need CAP_SYS_ADMIN to work. b) We can supply the list of capabilities that should not be used by CRIU via the CLI, e.g. "--no-cap-sys-admin". With the list of caps that we have and don't have, we can disables all non-compatible features on boot.

b) is a superset of a) and would be easy to use for users. Most importantly, the behavior of CRIU is deterministic given a list of CLI options.

nviennot avatar Aug 05 '20 15:08 nviennot

I created a vagrant based test using a 5.8.0 kernel for now. Once the Fedora vanilla kernel mainline repository picks up 5.9.0-rc1 the test should work. Right now I added the test more as a proof of concept. To see how it could look like.

https://travis-ci.org/github/checkpoint-restore/criu/jobs/715943265

adrianreber avatar Aug 08 '20 08:08 adrianreber

Test is now running on Fedora with 5.9.0-rc0 kernel and CAP_CHECKPOINT_RESTORE

adrianreber avatar Aug 14 '20 20:08 adrianreber

I rebased this and pushed a new version in which I tried to switch to using /proc/pid/status.

adrianreber avatar Sep 02 '20 13:09 adrianreber

Pushed again to fix a CI error with usernsd.

adrianreber avatar Sep 02 '20 14:09 adrianreber

@nviennot Any comments from your side to the current patches?

adrianreber avatar Sep 29 '20 12:09 adrianreber

@adrianreber running pre-dump as non-root user fails with:

(00.023457) Error (criu/mem.c:44): BUG at criu/mem.c:44

rst0git avatar Sep 29 '20 21:09 rst0git

@rst0git What does happen if you run pre-dump as non-root before my patches? I am aware that my patches are only touching a small part of what needs to be done to have the complete code base ready for CAP_CHECKPOINT_RESTORE. But it at least provides a working checkpoint and restore implementation as non-root.

I would prefer an incremental approach on the criu-dev branch. If the patches in this PR are correct I would rather see this merged and then look at the pre-dump use case.

adrianreber avatar Sep 30 '20 13:09 adrianreber

@rst0git What does happen if you run pre-dump as non-root before my patches?

@adrianreber I can confirm that the same error occurs before applying the patches in this PR and I agree with you that incremental approach on the criu-dev branch would be better.

rst0git avatar Sep 30 '20 16:09 rst0git

A friendly reminder that this PR had no activity for 30 days.

github-actions[bot] avatar Jan 07 '21 00:01 github-actions[bot]

I'll take a look next week

nviennot avatar Jan 07 '21 01:01 nviennot

@adrianreber @nviennot l What is going on with this PR. Let's make it ready for the next round of reviews.

avagin avatar Jan 24 '21 18:01 avagin

Sorry, I'm swamped right now. I didn't find time to look at this. Hopefully this month I'll make some progress. Sorry.

nviennot avatar Jan 28 '21 05:01 nviennot

A friendly reminder that this PR had no activity for 30 days.

github-actions[bot] avatar Feb 28 '21 00:02 github-actions[bot]

A friendly reminder that this PR had no activity for 30 days.

github-actions[bot] avatar Apr 01 '21 00:04 github-actions[bot]

People are waiting for this feature to arrive as soon as possible. It is very important to be able to run the CRIU as a non-root user. @nviennot please take into considaration.

PavloMykhailyshyn avatar Apr 06 '21 16:04 PavloMykhailyshyn

Yes, I'm sorry I've been quite unresponsive on this issue. I'm working on clearing some space for tackling this.

nviennot avatar Apr 06 '21 16:04 nviennot

@adrianreber could you rebase this pr and close comments that have been resolved. Thanks!

avagin avatar Apr 06 '21 17:04 avagin

So, the non-root test works, but is seems I broke a couple of other things. Need to investigate.

adrianreber avatar Apr 21 '21 07:04 adrianreber

I updated the PR and now all tests are green. I also added two zdtm test cases (env00 and pthread00) to be tested as non-root.

This is still a limited implementation of non-root CRIU but it works (for the two tests at least). It is a proof of concept.

I tried to use capabilities were possible. Unfortunately there are a couple of places where it does not make sense to check for capabilities. Right now the kerndat cache file is not written at all. We would need to probably just write it to the user's home-directory if uid != 0.

adrianreber avatar Apr 21 '21 18:04 adrianreber

Right now the kerndat cache file is not written at all. We would need to probably just write it to the user's home-directory if uid != 0.

Well, ~/.criu.kdat does not sound bad to me.

rppt avatar Apr 23 '21 09:04 rppt

Right now the kerndat cache file is not written at all. We would need to probably just write it to the user's home-directory if uid != 0.

Well, ~/.criu.kdat does not sound bad to me.

I also thought about it and the current design, having it on a tmpfs, makes sure the file is gone after a reboot. Maybe we could track CLOCK_MONOTONIC to see if we should recreate .criu.kdat. Hopefully no one is running CRIU in a time namespace :wink: - that would break CLOCK_MONOTONIC as an indicator if the system has been rebooted.

adrianreber avatar Apr 23 '21 10:04 adrianreber

Well, ~/.criu.kdat does not sound bad to me.

I also thought about it and the current design, having it on a tmpfs, makes sure the file is gone after a reboot. Maybe we could track CLOCK_MONOTONIC to see if we should recreate .criu.kdat. Hopefully no one is running CRIU in a time namespace wink - that would break CLOCK_MONOTONIC as an indicator if the system has been rebooted.

There is a related issue #1441 about detecting system reboots. I didn't check the details, maybe it does not help here.

rppt avatar Apr 23 '21 12:04 rppt

Well, ~/.criu.kdat does not sound bad to me.

I also thought about it and the current design, having it on a tmpfs, makes sure the file is gone after a reboot. Maybe we could track CLOCK_MONOTONIC to see if we should recreate .criu.kdat. Hopefully no one is running CRIU in a time namespace wink - that would break CLOCK_MONOTONIC as an indicator if the system has been rebooted.

There is a related issue #1441 about detecting system reboots. I didn't check the details, maybe it does not help here.

I do not think that #1441 can help here. It is more for processes being migrated than for CRIU.

adrianreber avatar Apr 26 '21 13:04 adrianreber

I tried to implement most of the changes from @rppt's review. I did not move the uid 0 test to a function because it felt weird to replace one line of code with a function.

The biggest changes in this update is that now criu check should work as non-root. I also added a run of criu check to the CI test case. I also changed the CI test case to completely run as non-root. Only the setcap call runs as root. Building and testing is all done without root.

Let's see if CI likes my latest changes.

adrianreber avatar Apr 26 '21 15:04 adrianreber

I did some manual testing with this PR applied and took some notes. In case this could be of any help:

  • rootless required cap_sys_ptrace additionally to cap_checkpoint_restore on my system
  • rootless dump using criu with --shell-job works but restore doesn't, programs have to be completely detached from shells right now ((criu/tty.c:846): tty: Can't set tty params on 0x8: Operation not permitted). Capabilities were set to criu.
  • the libcriu C RPC API doesn't work rootless and fails very early ((criu/image.c:550): Can't open dir /proc/4160/fd/3: Permission denied). Capabilities were set to the c program.
  • all experiments worked with root, even a self checkpointing JVM using libcriu with tcp established etc (although i found other bugs regarding libcriu, but this would be off topic)

mbien avatar Apr 27 '21 14:04 mbien

I did some manual testing with this PR applied and took some notes. In case this could be of any help:

Thanks a lot for testing.

* rootless required cap_sys_ptrace additionally to cap_checkpoint_restore on my system

Either that or echo 0 > /proc/sys/kernel/yama/ptrace_scope. The ptrace_scope value seems to be different depending on the used distribution.

* rootless dump using criu with --shell-job works but restore doesn't, programs have to be completely detached from shells right now ((criu/tty.c:846): tty: Can't set tty params on 0x8: Operation not permitted). Capabilities were set to criu.

I have seen that. Not sure if that can be fixed. non-root checkpoint/restore will have limitations. Maybe that is one of them.

* the libcriu C RPC API doesn't work rootless and fails very early ((criu/image.c:550): Can't open dir /proc/4160/fd/3: Permission denied). Capabilities were set to the c program.

Hmm. Never tried it and I do not expect it to work, but capabilities need to be set on the CRIU binary as libcriu is just wrapper around the CRIU binary.

* all experiments worked with root, even a self checkpointing JVM using libcriu with tcp established etc (although i found other bugs regarding libcriu, but this would be off topic)

adrianreber avatar Apr 27 '21 15:04 adrianreber

  • the libcriu C RPC API doesn't work rootless and fails very early ((criu/image.c:550): Can't open dir /proc/4160/fd/3: Permission denied). Capabilities were set to the c program.

Hmm. Never tried it and I do not expect it to work, but capabilities need to be set on the CRIU binary as libcriu is just wrapper around the CRIU binary.

i tried that too actually, just wanted to keep the post short. If i set caps to criu without giving the c program any i get this during checkpoint:

(00.000023) Version: 3.15 (gitid v1.5-5425-gb19af3c6)
(00.000021) Running on longbow Linux 5.10.30-1-MANJARO #1 SMP Wed Apr 14 08:07:27 UTC 2021 x86_64
(00.000028) Warn  (criu/kerndat.c:869): Can't load /run/criu.kdat
(00.000038) sockets: Probing sock diag modules
(00.000074) sockets: Done probing
(00.001256) Error (criu/util.c:643): exited, status=4
(00.002420) Error (criu/util.c:643): exited, status=4
(00.002464) Error (criu/kerndat.c:54): Can't open 0/pagemap on procfs: Permission denied
(00.002471) Error (criu/kerndat.c:1192): check_pagemap failed when initializing kerndat.
(00.002494) Found mmap_min_addr 0x10000
(00.002504) files stat: fs/nr_open 1073741816

mbien avatar Apr 27 '21 15:04 mbien

If that is true CRIU remembers the euid and decides at different places in the code to not do certain things which do not work as non-root.

In such cases, we need to get approval from a user to work in this mode. I mean we need to add a new criu option and describe in the documentation what certain things will not be made in this case.

And in the code, we will check the option flag rather than opts.uid. IMHO this will look more cleaner. Right now, this series checks uid when we need to check capabilities (CAP_SYS_RESOURCE, CAP_NET_ADMIN, CAP_SETGID, CAP_SETUID, etc). The uid check doesn't take into account a case when a user want to dump processes in a separate user namespace where criu has all set of capabilities and its uid is 0.

avagin avatar Apr 29 '21 17:04 avagin