bubblewrap
bubblewrap copied to clipboard
Add an option to disable nested user namespaces by setting limit to 1
Some use-cases of bubblewrap want to ensure that the subprocess can't
further re-arrange the filesystem namespace, or do other more complex
namespace modification. For example, Flatpak wants to prevent sandboxed
processes from altering their /proc/$pid/root/.flatpak-info, so that
/.flatpak-info can safely be used as an indicator that a process is part
of a Flatpak app.
This approach was suggested by lukts30 on containers/bubblewrap#452. The sysctl-controlled maximum numbers of namespaces are themselves namespaced, so we can disable nested user namespaces by setting the limit to 1 and then entering a new, nested user namespace. The resulting process loses its privileges in the namespace where the limit was set to 1, so it is unable to move the limit back up.
This still needs an automated test, and needs testing on a system where setuid bwrap is required (e.g. Debian 10).
Note that this would also block exploiting vulnerabilities like CVE-2022-34918 without relying on some form of syscall filtering. This prevents processes inside the "sandbox" of utilizing exploitable kernel code that is only guarded by ns_capable checks without disabling unprivileged namespace globally.
@lukts30 there is at least one such vulnerability almost every month, I have an nonextensive list at https://github.com/netblue30/firejail/issues/4939#issuecomment-1072662932.
Note that this would also block exploiting vulnerabilities like https://github.com/advisories/GHSA-9v26-h3ph-p8v7 without relying on some form of syscall filtering
That's almost exactly my motivation for this: at the moment, Flatpak has seccomp-based syscall filtering as an essential part of its security model (in order to stop container payloads from entering new user namespaces, which it has to prevent because that would subvert its idea of identity), and I want to be able to stop doing that, at least for some apps.
I primarily want this because we can get stronger guarantees in user-space if entering a new userns is prevented, and you primarily want this in order to reduce kernel attack surface, but the same mechanism helps both.
This still needs an automated test
I added one.
needs testing on a system where setuid bwrap is required (e.g. Debian 10)
On such systems, this feature doesn't work, because the unprivileged user doesn't have permission to set the maximum number of namespaces. If Flatpak (or another bubblewrap user) uses this feature, it will have to do it conditionally, similar to how it already handles various other things that don't work while setuid. This class of systems is increasingly obsolete, so I think this is reasonable.
However, on such systems, the unprivileged user can't create new user namespaces anyway - they can't do it themselves, because the kernel configuration won't allow it, and they also can't do it via the setuid bwrap, because PR_SET_NO_NEW_PRIVS won't allow the setuid bit to take effect.
I really like this approach, although I feel that there are possible issues with the way flatpak relies on the 2 nested user namespaces, with the --userns2 argument in the parent_expose_pids || parent_share_pids support. In this case we would have now three user namespaces, and we'd enter the 3rd one. Would we need a --userns3?
I'm not sure if that is required, but also I don't really see a reason for having 3 user namespaces in this case. Can we not reuse the already existing unshare (CLONE_NEWUSER), possibly entering that codepath when we did not before by adding a || opt_disable_userns to the if?
there are possible issues with the way flatpak relies on the 2 nested user namespaces, with the --userns2 argument in the parent_expose_pids || parent_share_pids support. In this case we would have now three user namespaces, and we'd enter the 3rd one
That's a good point, I'm not sure how this change would interact with that.
I updated this to do the limits inside the code that already created the second userns. I believe that will be enough to avoid the issues with flatpak i mentioned. I'll do some testing with that.
I also added an "incompat error" for disable-userns and --userns-block-fd, because that doesn't seem to work together.
So, playing with this in flatpak, and there is a general issue with the "expose-pid" and "share-pid" sandbox options in that they pass --userns {parent-pid} (as well as ---userns2) which is incompatible with this form of --disable-userns. However, I guess that isn't really a problem as we will be putting the sandboxed thing in a userns that has the right limits anyway. I'll verify this works by temporarily disabling the seccomp-based namespace limits.
So, testing this, and it seems to work. However, when testing writes to /proc/sys/user/max_user_namespaces I always get permission errors unless I also add --cap-add all. So I guess in practice, in many cases we don't need the recursive namespaces. It doesn't hurt to keep it though.
I made a WIP flatpak PR to use this.
@smcv Can you review this to make sure it looks ok to you. Maybe test with e.g. the steam sandbox stuff.
Maybe test with e.g. the steam sandbox stuff.
Yes, it works. With https://github.com/flatpak/flatpak/pull/5084 at commit e5c7e25, which includes this branch at 11d5339, I can run Steam Flatpak, and also run a game in the sub-sandbox.
This lgtm.