runtime-spec [RFC] allow to skip `setgroups(2)`

[RFC] allow to skip `setgroups(2)`

Open giuseppe opened this issue 4 years ago • 13 comments

There are cases where it would be necessary to skip the setgroups(2) syscall so that the original additional groups can be maintained.

It can be used, for example, by rootless containers to keep access to a storage directory that is accessible only by a secondary group.

runc already skips the setgroups in some cases: either if the user had euid != 0 or if /proc/self/setgroups is set to deny. I'd like to add a third condition where the setgroups is skipped also if explicitly requested.

Do we need a new field under process/user, e.g. keepOriginalGroups? Would be enough to reuse additionalGids to have some special value (e.g. -1 to keep current groups)?

Oct 29 '19 13:10 giuseppe

If we do add an option, it needs to have a really scary name (disableSetgroupSecurity or something). Not dropping supplementary groups weakens the userns security boundary, and really is something that very few people should actually want to do (not least of all because it will confuse all sorts of programs to be touching unmapped files).

In my view, the best solution to the problem of such volumes is to do exactly what LXD does -- "punch out" the GID that the storage volume is owned by (by adding a single 1:1 mapping for that GID). The most ideal solution would be the next-gen "shiftfs" work that was discussed recently, but obviously we'll have to wait for that to actually land.

Oct 29 '19 13:10 cyphar

I am skeptical, and think it could be a long wait, especially to get it upstream.

Oct 29 '19 13:10 rhatdan

(Also I would seriously suggest that this is functionality that should be exposed through a runtime-specific annotation and not a first-class field in config.json -- the runtime-spec already has lots of really odd features we probably shouldn't have added, and this one just rubs me the wrong way.)

Oct 29 '19 13:10 cyphar

In my view, the best solution to the problem of such volumes is to do exactly what LXD does -- "punch out" the GID that the storage volume is owned by (by adding a single 1:1 mapping for that GID). The most ideal solution would be the next-gen "shiftfs" work that was discussed recently, but obviously we'll have to wait for that to actually land.

how would it work with rootless containers or in general with IDs that are not mapped in the current namespace? I guess rootless containers won't still be able to map arbitrary IDs from the host.

Oct 29 '19 13:10 giuseppe

Hello @cyphar , I also agree with Giuseppe since I have a customer that has exactly the same problem: he is wanting to use rootless podman but is currently limited by the missing subgroups to access mounted directories.. Also not sure why you think it would be a a decreased security boundary: I mean all groups the user has configured are exptected to be inherited with such an option; why limit it to the main group only? About what kind of option should be used by runc, I'm of course open to any naming etc since that is less relevant. What is relevant for my customer is that he will still decide to go with rootless podman depending on this feature to work or not, since otherwise he does see only the chance to run it as root which from a scurity point of view I guess it's clearly worse..

Thanks for letting me know, Cisco.

Dec 18 '19 12:12 tentator

This feature has become very popular in Rootless Podman, We are seeing lots of users that need access to files and devices, via supplemental groups. We have recently made this a first class feature of podman.

podman run --group-add keep-id ...

Currently this is only supported in crun, and we would love to get it to work in runc. I would hope in the future we had better support, where we could keep access to the groups as well as add groups within the user namespace, but for now this fixes a key issue rootless users are hitting. I think we see this a lot more in enterprise customers then we even see in wild.

May 03 '21 15:05 rhatdan

@kolyshkin @mrunalp @AkihiroSuda @vrothberg FYI.

May 03 '21 15:05 rhatdan

As an unprivileged user on a host, I have read/write access to various files, some via ownership and some via group membership. I can mount any files I want into my container as volumes, but I can only read/write to the ones I own. The ones I access via my group memberships can only be read/written via podman with crun thanks to the --group-add option. I don't really understand why this is only possible via the special crun flag; if I can bring files into my rootless container by mounting them as volumes shouldn't I be able to access them in the same way as outside the container?

I know there are technical reasons, but I think the security model should be considered differently in different contexts. Sometimes a container is used to isolate and contain an external application (i.e. something pulled from a repository) in a controlled environment and you don't want it to see or touch anything outside. But in other cases (Singularity, rootless podman), you as the user are already "outside" and you're choosing to contain yourself , so you should have full control of how that happens and how to invoke the containment tool; the same security considerations do not apply since you can already do whatever you want on the host.

Aug 31 '21 22:08 rptaylor

@rptaylor

I know there are technical reasons, but I think the security model should be considered differently in different contexts. Sometimes a container is used to isolate and contain an external application (i.e. something pulled from a repository) in a controlled environment and you don't want it to see or touch anything outside. But in other cases (Singularity, rootless podman), you as the user are already "outside" and you're choosing to contain yourself , so you should have full control of how that happens and how to invoke the containment tool; the same security considerations do not apply since you can already do whatever you want on the host.

I want to add as the sysadmin of a HPC batch cluster at a major biomed academic center, this secondary group issue is the primary reason we are using Singularity rather than rootless podman. We use secondary groups extensively for various users and group to work together on sensitive data sets. Containers are used to run analysis programs like Tensorflow from NVIDIA NGC or distributed docker images of apps built on (for example) Ubuntu 20 that otherwise cannot run on the RHEL7 nodes.

May 18 '22 16:05 paulraines68

@rptaylor

I know there are technical reasons, but I think the security model should be considered differently in different contexts. Sometimes a container is used to isolate and contain an external application (i.e. something pulled from a repository) in a controlled environment and you don't want it to see or touch anything outside. But in other cases (Singularity, rootless podman), you as the user are already "outside" and you're choosing to contain yourself , so you should have full control of how that happens and how to invoke the containment tool; the same security considerations do not apply since you can already do whatever you want on the host.

I want to add as the sysadmin of a HPC batch cluster at a major biomed academic center, this secondary group issue is the primary reason we are using Singularity rather than rootless podman. We use secondary groups extensively for various users and group to work together on sensitive data sets. Containers are used to run analysis programs like Tensorflow from NVIDIA NGC or distributed docker images of apps built on (for example) Ubuntu 20 that otherwise cannot run on the RHEL7 nodes.

If it can be useful for you: Podman when used together with crun supports the --group-add keep-groups extension to skip setgroups in the container

May 18 '22 19:05 giuseppe

If it can be useful for you: Podman when used together with crun supports the --group-add keep-groups extension to skip setgroups in the container

crun is not available on RHEL7 that I can find

There is an oddness on CentOS8 Stream box

$ rpm -q podman podman-4.0.2-1.module_el8.7.0+1106+45480ee0.x86_64 $ rpm -q runc runc-1.0.3-3.module_el8.7.0+1106+45480ee0.x86_64 $ ls -ald /tmp/gptest drwxrws---. 2 root sysadm 4096 May 18 15:44 /tmp/gptest $ groups raines httpd fsdev sysadm coregp webdev hcpdata $ podman run -it --runtime=/usr/bin/crun --userns=keep-id --group-add=keep-groups -v /tmp/gptest:/gptest b1b6387124d9 /bin/bash raines@806c89baacd3:/$ groups raines nogroup raines@806c89baacd3:/$ id uid=5829(raines) gid=5829(raines) groups=5829(raines),65534(nogroup) raines@806c89baacd3:/$ cd /tmp/gptest bash: cd: /tmp/gptest: No such file or directory raines@806c89baacd3:/$ cd /gptest raines@806c89baacd3:/gptest$ uname -a > foobar.txt bash: foobar.txt: Permission denied raines@806c89baacd3:/gptest$ ls -ald . drwxrws---. 2 nobody nogroup 4096 May 18 19:44 .

The 'nogroup' thing is wierd (singularity reports the groups normally) and I don't understand why I can cd to /gptest (read access) but not write.

May 18 '22 20:05 paulraines68

It is not available on RHEL7. I think you need to specify the --runtime option to o podman before the run like podman --runtime=... run ...

May 18 '22 20:05 giuseppe

Unfortunately still not quite right:

$ echo here > /tmp/gptest/iamhere.txt
$ mkdir /tmp/gptest/subdir
$ ls -ald /tmp/gptest/subdir
drwxrwsr-x. 2 raines sysadm 4096 May 18 16:46 /tmp/gptest/subdir
$ podman --runtime=/usr/bin/crun run -it --rm --annotation=run.oci.keep_original_groups=1 --userns=keep-id --group-add=keep-groups -v /tmp/gptest:/gptest b1b6387124d9 /bin/bash
raines@6bcd8fc2304e:/$ groups
raines nogroup
raines@6bcd8fc2304e:/$ ls -ld /gptest
drwxrws---. 2 nobody nogroup 4096 May 18 19:44 /gptest
raines@6bcd8fc2304e:/$ cd /gptest
raines@6bcd8fc2304e:/gptest$ ls
ls: cannot open directory '.': Permission denied
raines@6bcd8fc2304e:/gptest$ echo foobar > foobar.txt
bash: foobar.txt: Permission denied
raines@6bcd8fc2304e:/gptest$ cat iamhere.txt
here
raines@6bcd8fc2304e:/gptest$ echo too >> iamhere.txt
raines@6bcd8fc2304e:/gptest$ cat iamhere.txt
here
too
raines@6bcd8fc2304e:/gptest$ cd subdir
raines@6bcd8fc2304e:/gptest/subdir$ ls
ls: cannot open directory '.': Permission denied
raines@6bcd8fc2304e:/gptest/subdir$ ls -ald /gptest/subdir
drwxrwsr-x. 2 raines nogroup 4096 May 18 20:46 /gptest/subdir

So actually it is 'x' bit that works for the cd, but 'r' and 'w' do not. But one can read and write to existing files in the dir. Really wierd.

May 18 '22 20:05 paulraines68

runtime-spec runtime-spec copied to clipboard

[RFC] allow to skip `setgroups(2)`

runtime-spec
runtime-spec copied to clipboard