bubblewrap icon indicating copy to clipboard operation
bubblewrap copied to clipboard

[Question] bwrap in LXC

Open smtalk opened this issue 5 years ago • 23 comments

Hello,

Is bwrap suitable for sandboxing apps/users in LXC environment? If yes - any special flag for it?

# su - test -s /usr/bin/bwshell
Last login: Mon Mar 30 20:30:16 EEST 2020 on pts/2
bwrap: pivot_root: Permission denied
[pid 14230] pivot_root("/tmp", "oldroot") = -1 EACCES (Permission denied)
[pid 14230] write(2, "bwrap: ", 7bwrap: )      = 7
[pid 14230] write(2, "pivot_root", 10pivot_root)  = 10
[pid 14230] write(2, ": Permission denied\n", 20: Permission denied

smtalk avatar Mar 30 '20 17:03 smtalk

LXC probably doesn't leave it with enough privileges to run successfully. In general, "nesting" containers is harder to do successfully or securely than creating a single level of container on a "bare metal" machine or VM.

smcv avatar Mar 30 '20 17:03 smcv

@smcv yes, I was just thinking if bwrap could be used universally for jailed shell :) Even if it’s not so “secure” as jailing in KVM/Xen or bare-metal.

smtalk avatar Mar 30 '20 18:03 smtalk

Related issue: https://gitlab.steamos.cloud/steamrt/steam-runtime-tools/-/issues/35 (Recent versions of Steam Proton are failing in LXC, despite earlier versions working fine, due to the addition of Pressure Vessel, which uses bubblewrap.)

foresto avatar Aug 05 '21 08:08 foresto

LXC probably doesn't leave it with enough privileges to run successfully. In general, "nesting" containers is harder to do successfully or securely than creating a single level of container

LXC explicitly supports nested containers. Here is the relevant LXC container config option:

lxc.apparmor.profile = lxc-container-default-with-nesting

The bubblewrap errors that I am encountering are due to the above apparmor profile being permissive enough for nested LXC containers, but not permissive enough for whatever system calls bubblewrap is attempting on Proton's behalf. That apparmor profile isn't static, though; we can configure whatever rules we want.

Given that LXC can already nest containers if the appropriate apparmor profile is used, it seems to me that there should be a way to make bubblewrap and LXC cooperate. (And if that can be done, it also seems there should be a solution for Steam & Proton.)

foresto avatar Aug 05 '21 09:08 foresto

The bubblewrap errors that I am encountering are due to the above apparmor profile being permissive enough for nested LXC containers, but not permissive enough for whatever system calls bubblewrap is attempting on Proton's behalf. That apparmor profile isn't static, though; we can configure whatever rules we want.

As a step towards this, please try to get it to work with the LXC container being completely unconfined. If that can't work, then it definitely won't work with AppArmor restrictions.

I think there are probably two sides to this. One is that the LXC container can apply AppArmor (and maybe seccomp?) restrictions that prevent bubblewrap from doing its job; you can avoid this factor by making the LXC container unconfined. After you have a proof-of-concept with it unconfined, we can either allow more operations in the AppArmor profile, or potentially do things slightly differently in bubblewrap so that it is only doing things that LXC's AppArmor profile would allow.

The other is that running bwrap in a chroot (#135) is known not to work, because the chroot breaks one of the conditions for pivot_root(); LXC might be suffering from something similar. The conditions for a successful pivot_root() are not obvious, and when they are not met the only diagnostic is EINVAL, so it is not straightforward to determine what bubblewrap should be doing differently here (I've tried and failed in the past).

smcv avatar Aug 05 '21 09:08 smcv

@smcv I'd like to investigate further, but running Steam with an unconfined apparmor profile defeats the purpose of my running it in LXC at all. For the sake of testing without having to poke giant holes in my sandbox, can you tell me a bubblewrap command line I could use to test this chroot issue?

Also, what is creating the chroot you're referring to? Steam? If the chroot syscall turns out to be a real blocker for bwrap, couldn't it be replaced with a different approach, like a mount namespace?

foresto avatar Aug 05 '21 09:08 foresto

For the sake of testing without having to poke giant holes in my sandbox, can you tell me a bubblewrap command line I could use to test this chroot issue?

Any bubblewrap command line would do. The simplest is bwrap --dev-bind / / true, or for something more thorough, you could run bubblewrap's own test suite (./autogen.sh && make && make check).

Also, what is creating the chroot you're referring to? Steam?

No, it's the equivalent of your use of LXC. Most Steam users run it on "the real system" (for which bubblewrap is fine), but some people run Steam in a LXC container, or in a Docker container, or in schroot (which uses chroot(2) internally), or some similar environment, and that doesn't currently work in all cases.

Similarly, most people who run bubblewrap for other purposes (Flatpak, WebKitGTK, libgnome-desktop, etc.) are running it on "the real system", but some people try to run it inside a LXC container, inside a Docker container, or in schroot, and that doesn't currently work.

bubblewrap does create a new mount namespace, but it needs to use either pivot_root(2) or chroot(2) to get its root directory to be the new root that it has created. It currently uses pivot_root(2) for that.

When Steam games run under the Steam container runtime or recent Proton versions, that is exactly a mount namespace, using bubblewrap. Steam doesn't use chroot or pivot_root itself.

smcv avatar Aug 05 '21 10:08 smcv

As a step towards this, please try to get it to work with the LXC container being completely unconfined. If that can't work, then it definitely won't work with AppArmor restrictions.

Okay, I made another (unprivileged) lxc container and configured it with lxc.apparmor.profile = unconfined. Then, inside the container:

$ bwrap --dev-bind / / true && echo it works
it works

I think that means there should be a solution here, if we can find the minimal set of apparmor permissions (or bwrap changes) to get it working without the container running unconfined. Right?

What would you suggest next?

foresto avatar Aug 12 '21 02:08 foresto

Progress!

The following steps got bwrap to work in an (unprivileged) lxc container:

  • Set this key in the lxc container profile: lxc.apparmor.profile = lxc-container-bwrap
  • cp /etc/apparmor.d/lxc/lxc-default-with-nesting /etc/apparmor.d/lxc/lxc-bwrap
  • Change the profile name in the new file from lxc-container-default-with-nesting to lxc-container-bwrap
  • Add these rules to the new profile:
    # bwrap support
    pivot_root oldroot=/tmp/oldroot/ /tmp/,
    pivot_root /newroot/,
    mount options=rbind /oldroot/ -> /newroot/,
    mount options=rbind /tmp/newroot/ -> /tmp/newroot/,
    mount options=(remount,bind,nosuid) options in (relatime) -> /newroot/{,**},
    mount options=rprivate -> /oldroot/,
    
  • If you want Steam's pressure-vessel to work, add these rules as well (and consider choosing a profile name like lxc-container-steam):
    # steam pressure-vessel bwrap support
    mount options=rbind -> /newroot/**,
    mount options=(remount,bind,nosuid,nodev) options in (noexec,relatime,ro) -> /newroot/{,**},
    
  • Load the new LXC AppArmor profile into the kernel: apparmor_parser -r /etc/apparmor.d/lxc-containers

I have only tested this on Ubuntu 20.04 LTS so far. Different LXC versions and configurations might require something different. This command is handy for watching AppArmor complaints on a distro that uses systemd: journalctl _TRANSPORT=audit _COMM=bwrap --follow

I haven't thought deeply about the security implications of these rules, but just seeing it work was an exciting step forward. It might be worth a review from the LXC maintainers, and maybe asking them to include a bubblewrap/steam AppArmor profile alongside the ones they already provide for nested containers and other common cases. This would relieve users from duplicating upstream policy files, and avoid falling out of sync with upstream changes.

Both bwrap and pressure-vessel do more {bind,re,}mount jugging than I would have expected. I tried to minimize the number of AppArmor rules required to accommodate it all, by leaning on globbing and the in conditional operator, but the result is still a bit more verbose than I would like and might not be as restrictive as it should be. I wonder if the bubblewrap and pressure-vessel maintainers could do anything differently to help with this.

foresto avatar Sep 16 '21 01:09 foresto

I think if you were using bubblewrap non-trivially (actually changing processes' view of the filesystem), you would find that the rules you found that you needed for pressure-vessel are also necessary for most (all?) non-trivial uses of bubblewrap. bwrap --dev-bind / / true was intentionally the simplest possible test-case, which only has one user-specified bind-mount, / onto / (implementation detail: this ends up as /oldroot onto /newroot because of the way it works internally), but a less trivial bubblewrap invocation (like the ones done by Flatpak, or by libgnome-desktop's support for sandboxed thumbnailers) is just as complicated as the ones done by pressure-vessel.

I'm somewhat surprised you didn't also need to add options to mount a tmpfs below /newroot.

I wonder if the bubblewrap and pressure-vessel maintainers could do anything differently to help with this.

Everything pressure-vessel does, it does because there is some reason why it has to. I wish it could be simpler, but I don't think it can.

If you look into the code and commit history for either bubblewrap, pressure-vessel, or another user of bubblewrap such as Flatpak, I think you'll see that it's all there for a reason. Simpler would be better, but we do have to make things "as simple as possible, but no simpler".

smcv avatar Sep 16 '21 08:09 smcv

I haven't thought deeply about the security implications of these rules

I'm afraid pivot_root is known bypass for apparmor rules: https://bugs.launchpad.net/apparmor/+bug/1791711 therefore allowing it may make false sense of security.

Generally protecting something that allows you to arbitrary manipulate fs paths like bubblewrap is undoable with something that rely on those paths to protect like apparmor.

Maryse47 avatar Sep 17 '21 14:09 Maryse47

I'm afraid pivot_root is known bypass for apparmor rules: https://bugs.launchpad.net/apparmor/+bug/1791711 therefore allowing it may make false sense of security.

Thanks for pointing that out. Given that the current recommendation for running bubblewrap in LXC containers is to use lxc.apparmor.profile = unconfined, I don't think having a bwrap-specific AppArmor profile would be any worse. It would presumably be offered with the same cautions that come with unconfined, thus avoiding a false sense of security, and it should still enforce path-agnostic rules even if exploited, thus being better than running unconfined.

Also, as noted in comment #8 of that report, the AppArmor maintainers intend to fix the pivot_root problem.

In the meantime, I wonder if an AppArmor policy could be crafted that would allow pivot_root only for the container's /usr/bin/bwrap, meaning an attacker would first have to become uid 0 within the LXC container in order to exploit it.

foresto avatar Sep 21 '21 21:09 foresto

I'm afraid pivot_root is known bypass for apparmor rules

It depends on the AppArmor rules and how they are being used.

If you're using AppArmor in a way that runs all programs in the same mount namespace and completely relies on path-based policies, like the traditional use of /etc/apparmor.d in openSUSE/Ubuntu/Debian to apply rules like "Firefox can run Evince" and "Evince can't read ~/.gnupg", then yes, any filesystem manipulation like pivot_root will defeat that.

If you're using AppArmor as part of lxc, as a way to prevent container breakout by blocking things like mount operations and other VFS manipulation, similar to the way Docker and Flatpak use seccomp, then the path-based parts of AppArmor are hopefully a lot less important, because anything the container shouldn't be able to access shouldn't be visible in the container's filesystem namespace at all. You can't read a file if there is no name you can give that will result in it being opened! That's how Docker and Flatpak manage to do access-control for files despite not using a path-based LSM like AppArmor: they build a filesystem namespace where only the allowed files exist.

smcv avatar Sep 22 '21 11:09 smcv

I'm somewhat surprised you didn't also need to add options to mount a tmpfs below /newroot.

Ah, I see this is because /etc/apparmor.d/abstractions/lxc/container-base already allows that.

smcv avatar Sep 22 '21 11:09 smcv

If you're using AppArmor as part of lxc, as a way to prevent container breakout by blocking things like mount operations and other VFS manipulation, similar to the way Docker and Flatpak use seccomp, then the path-based parts of AppArmor are hopefully a lot less important, because anything the container shouldn't be able to access shouldn't be visible in the container's filesystem namespace at all

The thing is lxc apparmor profiles rely on apparmor blocking access to various sensitive files which are visible in container namesapce. With bublewrap you may mount those files in different path in container which make apparmor protection useless.

Maryse47 avatar Sep 22 '21 12:09 Maryse47

The thing is lxc apparmor profiles rely on apparmor blocking access to various sensitive files which are visible in container namesapce

If that's the case, then it is not possible to do this securely, and you'll need to either:

  • use a different container technology that does not rely on path-based access control (such as Docker or Podman) for container payloads that need the ability to rearrange the namespace; or
  • use a container technology that forbids namespace rearrangement but does allow limited application control over creation of extra containers in parallel (such as Flatpak "sub-sandboxing", which pressure-vessel has specific code to make use of); or
  • use a weaker AppArmor profile, and live with the fact that if you get root in the container, you get root in real life (sometimes summarized as "containers don't contain")

Flatpak protects a few sensitive files in /proc//sys by mounting an inaccessible read-only file over the top, and I think Docker does the same, but it looks as though that approach wouldn't work for lxc, because /etc/apparmor.d/abstractions/lxc/container-base allows umount.

If you are using lxc to confine Steam, I'd recommend the unofficial Flatpak app as an alternative to that. It has some limitations that make Valve hesitant to recommend it in general or treat it as official, but running Steam inside lxc will already have most of those limitations anyway. Using fast user switching to run Steam as an unprivileged user, either on its own or combined with Flatpak, is another way Steam can be given fewer privileges.

smcv avatar Sep 22 '21 13:09 smcv

Feels like a bunch of the issues here may have been related to the direct use of LXC rather than something like LXD which knows to configure a variety of LXC features to make nesting work nicely.

Here is an example on LXD which seems to behave just fine:

stgraber@castiana:~$ lxc launch images:ubuntu/21.10 u1 -c security.nesting=true
Creating u1
Starting u1
stgraber@castiana:~$ lxc exec u1 bash
root@u1:~# apt install bubblewrap
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  bubblewrap
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 39.9 kB of archives.
After this operation, 116 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu impish/main amd64 bubblewrap amd64 0.4.1-3 [39.9 kB]
Fetched 39.9 kB in 0s (232 kB/s)   
debconf: unable to initialize frontend: Dialog
debconf: (Dialog frontend requires a screen at least 13 lines tall and 31 columns wide.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (Can't locate Term/ReadLine.pm in @INC (you may need to install the Term::ReadLine module) (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.32.1 /usr/local/share/perl/5.32.1 /usr/lib/x86_64-linux-gnu/perl5/5.32 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl-base /usr/lib/x86_64-linux-gnu/perl/5.32 /usr/share/perl/5.32 /usr/local/lib/site_perl) at /usr/share/perl5/Debconf/FrontEnd/Readline.pm line 7, <> line 1.)
debconf: falling back to frontend: Teletype
Selecting previously unselected package bubblewrap.
(Reading database ... 14498 files and directories currently installed.)
Preparing to unpack .../bubblewrap_0.4.1-3_amd64.deb ...
Unpacking bubblewrap (0.4.1-3) ...
Setting up bubblewrap (0.4.1-3) ...
sysctl: permission denied on key "kernel.unprivileged_userns_clone", ignoring
root@u1:~# vi /usr/local/bin/bwshell
root@u1:~# chmod +x /usr/local/bin/bwshell
root@u1:~# su - ubuntu -s /usr/local/bin/bwshell
bwrap-demo$ 

Everything that LXD does can of course be done by hand with LXC, but that may require quite a lot of care, especially when dealing with something as tricky as nesting. The exact config may also be quite dependent on the exact set of kernel features and host OS configuration.

In this case, the main things to get right are:

  • Run the LXC container unprivileged, privileged containers cannot do any of this safely at all and have a bunch of restrictions around nesting
  • /proc, /sys and /dev setup modes
  • Use AppArmor namespacing instead of a simple nesting profile (allows full use of AppArmor in the container)
  • Relax host side apparmor profile to allow nesting (primarily mounting new instances of /proc, /sys and pivot)
  • Pre-mount hidden non-overmounted copies of /proc and /sys (to avoid the kernel overmounting protection to kick in)

stgraber avatar Sep 22 '21 22:09 stgraber

I can't speak for @smtalk, who opened this issue, but I can respond to these suggestions with respect to my own use case...

@smcv suggested:

If you are using lxc to confine Steam, I'd recommend the unofficial Flatpak app as an alternative to that.

AFAICT, Flatpak allows images to define their own security policy, which is a curious choice that effectively makes externally-built ones not confined at all; I think I would have to either build my own Steam flatpaks (a hassle) or diligently review and override the permissions of every community-built release (more hassle). Also, Flatpak doesn't seem to make modifying a runtime environment or running multiple applications in the same container particularly easy, so operations like that would mean still more hassles compared to lxc.

On the other hand, perhaps those ongoing hassles would be tolerable (at least until AppArmor fixes the pivot_root issue) if accepting them meant my Steam games would work again.

The possible blocker that comes to mind is ALSA. Last time I considered Flatpak for Steam, it didn't support ALSA-only systems without resorting to --device=all, but I see they finally merged a fix for that (flatpak/flatpak#3663). Last time I tried the Freedesktop runtime, on which I think the Steam flatpak depends, some of its ALSA libs were broken, so nontrivial ALSA functionality didn't work even with a patched Flatpak. Maybe that has been fixed by now as well?

I suppose I could give it another try, and at least get an updated view of what problems remain.

Does the Steam flatpak requie a particular version of Flatpak, or a particular version of bubblewrap on the host system?

foresto avatar Sep 29 '21 01:09 foresto

@stgraber suggested:

Feels like a bunch of the issues here may have been related to the direct use of LXC rather than something like LXD which knows to configure a variety of LXC features to make nesting work nicely.

As far as I know, LXD didn't exist when I started using Steam in an LXC container. (Inspired by your blog, by the way.) At least, I wasn't aware of its existence back then.

I have considered moving to LXD, but as I recall, I was told it had no way for an unprivileged container to run on (a subtree of) the hosts's filesystem. I use that functionality in LXC. It makes a number of things convenient and efficient, such as using GUI tools to manage the container's files without having to install GUI tools in the container, and allowing a contained program to communicate with a program on the host via a unix domain socket. It was probably a couple years ago when I last asked, though; has this filesystem limitation been lifted since then?

If not, are the steps that LXD takes to make LXC nesting work nicely available in a format that that could be (relatively easily) copied to an LXC-only system?

foresto avatar Sep 29 '21 01:09 foresto

@foresto suggested:

It makes a number of things convenient and efficient, such as using GUI tools to manage the container's files without having to install GUI tools in the container.

Sorry to barge in on an unrelated matter: would you mind naming the GUI tools you use to manage the container's files?

gitfan2 avatar Nov 15 '21 09:11 gitfan2

Sorry to barge in on an unrelated matter: would you mind naming the GUI tools you use to manage the container's files?

When the guest's filesystem is a subtree of the host's, (which LXC allows), all the software running on the host system can directly access guest files. That includes shells, scripts, desktop file managers, save game editors... everything.

foresto avatar Nov 15 '21 16:11 foresto

A quick info that bubblewrap works perfectly in a "nested LXC" in Proxmox 6.4.
Obviously, it may or may not work in other virtualisation environments. Thanks to @smtalk, the OP, for his invaluable guidance in resolving the issue.

gitfan2 avatar Jul 26 '22 11:07 gitfan2

@gitfan2 For the sake of others who find their way here, what steps were involved in that guidance?

foresto avatar Jul 26 '22 18:07 foresto

@foresto Essentially, Proxmox is creating the structure in LXC that enables the nesting and is generating an apparmor profile to support it. Here are the relevant lines from the profile.

### Configuration: nesting pivot_root, ptrace, signal,

deny /dev/.lxc/proc/** rw, deny /dev/.lxc/sys/** rw,

mount fstype=proc -> /usr/lib/*/lxc/**, mount fstype=sysfs -> /usr/lib/*/lxc/**,

# TODO: There doesn't seem to be a way to ask for: # mount options=(ro,nosuid,nodev,noexec,remount,bind), # as we always get mount to $cdir/proc/sys with those flags denied # So allow all mounts until that is straightened out: mount,

Conclusion: In a nested LXC in Proxmox 6.4, bubblewrap 0.5+ works flawlessly and creates the required sandbox for users.

In the same apparmor profile, there's an additional section for nesting LXDs.

# Allow nested LXD mount none -> /var/lib/lxd/shmounts/, mount /var/lib/lxd/shmounts/ -> /var/lib/lxd/shmounts/, mount options=bind /var/lib/lxd/shmounts/** -> /var/lib/lxd/**,

gitfan2 avatar Nov 15 '22 03:11 gitfan2

I don't think there's an actionable request for a change to be made in bwrap here, so I'm closing this issue.

smcv avatar Nov 15 '22 17:11 smcv

I'd like to know if someone has a solution for this -- running flatpak/bwrap inside an LXC sandbox is something I would like to be able to do (I was able to do it in LXD using its nesting feature, but not in LXC, even with the apparmor changes in this discussion).

rcarmo avatar Jan 31 '24 10:01 rcarmo