sysbox
sysbox copied to clipboard
Mounting host unix sockets to Sysbox containers doesn't always work (e.g. X11 socket forwarding for GUI applications)
Steps to reproduce
In a typical Linux graphical environment, you can open a GUI application to run in a Docker container by forwarding a few files and environment variables from the host to the container, including the X11 unix socket in /tmp/.X11-unix:
$ xhost +local:root
$ docker run -v $XAUTHORITY:/root/.Xauthority -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=$DISPLAY -e XAUTHORITY=/root/.Xauthority alpine:3.13 sh -c "apk add xeyes && xeyes"
[...]
[GUI app opens]
However, this doesn't work in Sysbox when running in auto userns ID mapping, i.e. the following fails to work:
$ xhost +local:root
$ docker run --runtime=sysbox-runc -v $XAUTHORITY:/root/.Xauthority -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=$DISPLAY -e XAUTHORITY=/root/.Xauthority alpine:3.13 sh -c "apk add xeyes && xeyes"
[...]
Error: Can't open display: :0
Explanation
The problem happens because the X11 socket (typically /tmp/.X11-unix/X0) is not connectable within the Sysbox container:
$ docker run --init -v /tmp/.X11-unix:/tmp/.X11-unix alpine:3.13 sh -c "apk add netcat-openbsd && nc -U /tmp/.X11-unix/X0"
[No output - OK]
$ docker run --runtime=sysbox-runc --init -v /tmp/.X11-unix:/tmp/.X11-unix alpine:3.13 sh -c "apk add netcat-openbsd && nc -U /tmp/.X11-unix/X0"
nc: unix connect failed: Connection refused
The problem is due to shiftfs not working with UNIX sockets, i.e. UNIX sockets are never connectable on a shiftfs mount. This can be checked because a similar behavior can be reproduced on the host without involving Sysbox, by mounting a shiftfs mark and trying to connect to the socket. This explains why this only happens in auto userns ID mapping mode.
$ mkdir /tmp/.X11-unix-shiftfs
$ sudo mount -t shiftfs -o mark /tmp/.X11-unix /tmp/.X11-unix-shiftfs/
$ nc -U /tmp/.X11-unix/X0
[No output - OK]
$ nc -U /tmp/.X11-unix-shiftfs/X0
nc: unix connect failed: Connection refused
nc: /tmp/.X11-unix-shiftfs/X0: Connection refused
I haven't been able to find any information or functional reason why shiftfs breaks UNIX sockets (e.g. for a reason such as for security), though I'm not well-versed in all those considerations regarding shiftfs. It may just be a limitation.
Workaround
One can work around the issue by chown'ing /tmp/.X11-unix/ to the "sysbox" sub{u,g}id, e.g. as follows assuming it's set to 100000 in /etc/sub{u,g}id:
$ sudo chown 100000:100000 /tmp/.X11-unix/
This works because Sysbox will not mount the directory as a shiftfs but rather do a normal bind-mount if it detects the directory is already owned by the Sysbox {U,G}id. Even though this is a dirty hack, it doesn't appear to have any negative effect neither on the host nor the containers on a typical single-user setup.
Possible fix
I was looking into potential explanations or ways to make the UNIX socket in shiftfs, and while I was looking into this I found that in the past overlayfs mounts also had some trouble connecting to UNIX sockets which have been solved now.
It appears that for the UNIX socket code to work within the Linux kernel, it needs to see the real/underlying inode for the UNIX socket, i.e. a "cloned/mirrored" inode does not work. Nowadays it appears that overlayfs achieves this by having some logic to expose the real/underlying inode instead of a "cloned/mirrored" inode. At first glance I don't think that can even be done for shiftfs since the UID/GID is part of the inode so if we'd expose the real/underlying inode it would not have the shifted UID/GID.
However it appears that it's possible to "recycle" an old overlayfs fix for this by reverting the following commit (there's also some enlightening discussion in the commit message and associated commits):
https://github.com/torvalds/linux/commit/beef5121f3a4d1566c8ab8cd99b4e001862048cf
This is my attempt to reintroduce it for the kernel 5.11.x kernel branch, the permission check is a bit scary though so it'd be useful if this can be properly reviewed:
---
net/unix/af_unix.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 5a31307ceb76..0a94087c0240 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -319,7 +319,7 @@ static struct sock *unix_find_socket_byinode(struct inode *i)
&unix_socket_table[i->i_ino & (UNIX_HASH_SIZE - 1)]) {
struct dentry *dentry = unix_sk(s)->path.dentry;
- if (dentry && d_backing_inode(dentry) == i) {
+ if (dentry && d_real_inode(dentry) == i) {
sock_hold(s);
goto found;
}
@@ -935,8 +935,8 @@ static struct sock *unix_find_other(struct net *net,
err = kern_path(sunname->sun_path, LOOKUP_FOLLOW, &path);
if (err)
goto fail;
- inode = d_backing_inode(path.dentry);
- err = inode_permission(inode, MAY_WRITE);
+ inode = d_real_inode(path.dentry);
+ err = inode_permission(d_backing_inode(path.dentry), MAY_WRITE);
if (err)
goto put_fail;
@@ -1066,7 +1066,7 @@ static int unix_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
if (sun_path[0]) {
addr->hash = UNIX_HASH_SIZE;
- hash = d_backing_inode(path.dentry)->i_ino & (UNIX_HASH_SIZE - 1);
+ hash = d_real_inode(path.dentry)->i_ino & (UNIX_HASH_SIZE - 1);
spin_lock(&unix_table_lock);
u->path = path;
list = &unix_socket_table[hash];
--
2.31.1
Other possible fixes
In case the kernel fix isn't accepted (e.g. allowing UNIX sockets through shiftfs actually introduces some security problem, it's a WONTFIX, etc.) I think this could also be worked around in a cleaner way within Sysbox itself by adding some way to avoid the shiftfs mount for this directory, either in a generic way by providing some way to explicitly ask for a diectory not to be shiftfs-mounted but rather bind-mounted (e.g. by using docker labels), or by just considering a /tmp/.X11-unix mount a special case and not applying shiftfs.
Awesome problem description @joanbm, thank you for that! We will look into this one asap.
Might I suggest changing this issue title to Mounting unix sockets doesn't work with Sysbox in auto userns ID mapping mode? This issue is much broader than than just mounting X11 sockets or running GUI applications.
I modified this issue's title to reflect a bit more accurately the problem: mounting host sockets into a Sysbox container doesn't always work.
Specifically, it does not work when the host socket is owned by root on the host, as this forces Sysbox to mount shiftfs on it, which does not work yet (as @joanbm explained above).
On the other hand, if the socket on the host is owned by the same host user-ID mapped to the Sysbox container's root user via the Linux user-namespace, then the mount of the socket does work because Sysbox won't mount shiftfs on the socket and the root user inside the container will have permission to access the socket.
As also said in #383, this issue should be solved. Sharing unix sockets is no problem with upcoming Sysbox 0.5.0.
I have tested the current master version of Sysbox / upcoming version 0.5.0. The issue above should be solved with it. Sharing unix sockets is no longer a problem with new kernel feature of id-mapped mounts (kernel version >=5.12). Previous shiftfs failed with this.
I am not entirely sure if idmapped mounts are going to fully solve this issue, because the underlying filesystem needs to support them, and tmpfs - which is where the X11 UNIX sockets on /tmp/.X11-unix are usually on - doesn't seem to support it as of 5.16 - or at least I can't mount them using brauner's mount-idmapped.
To check this I've added a tmpfs entry in my /etc/fstab and rebooted. (Before /tmp wasn't a tmpfs in my system).
$ mount | grep /tmp
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,noatime,inode64)
$ uname -r
5.15.0-0.bpo.3-amd64
$ sysbox-runc --version
sysbox-runc
edition: Community Edition (CE)
version: 0.5.0-dev
commit: 0ae26fbb42d7aa5b902c7db007a59ac160f751be
built at: Fri Mar 4 14:55:20 UTC 2022
built by: Cesar Talledo
oci-specs: 1.0.2-dev
I still can mount an X unix socket successfully with Sysbox. Compare #452, I've made several tests for Sysbox with X unix sockets.
I just tried it with the latest Sysbox git build, a recent Linux kernel, and no shiftfs nor userns-remap so Sysbox uses idmapped mounts, and the X11 socket problem no longer reproduces and Sysbox doesn't complain...
...but /tmp/.X11-unix doesn't appear to be remapped, but rather mounted as a regular bind mount, as seen in the lack of the idmapped flag in findmnt and the sockets being owned by nobody inside the container:
host$ sysbox-runc --version
sysbox-runc
edition: Community Edition (CE)
version: 0.4.1
commit: c85420f1d1a426949047eb55e4112a3bc310aec2
built at: Di 22. Mär 23:42:14 CET 2022
built by: Joan Bruguera
oci-specs: 1.0.2-dev
host$ uname -a
Linux solpc 5.18.0-rc1-12001-mainline-git-01661-g519129040766 #1 SMP PREEMPT Tue, 22 Mar 2022 22:28:57 +0000 x86_64 GNU/Linux
host$ sudo systemctl start docker
host$ sudo systemctl start sysbox
host$ xhost +local:root
non-network local connections being added to access control list
host$ mkdir testdir
host$ echo hello > testdir/testfile
host$ sudo docker run --rm -it --runtime sysbox-runc -v $HOME/testdir:/testdir -v $XAUTHORITY:/root/.Xauthority -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=$DISPLAY -e XAUTHORITY=/root/.Xauthority alpine:3.13
[...]
/ # apk add xeyes findmnt
[...]
/ # xeyes
[WORKS]
/ # findmnt | egrep "X11|testdir"
├─/testdir /dev/mapper/solzealcharm[/Root/home/sol/testdir] btrfs rw,noatime,idmapped,ssd,discard,space_cache,subvolid=3368,subvol=/Root
├─/tmp/.X11-unix tmpfs[/.X11-unix] tmpfs rw,nosuid,nodev,nr_inodes=1048576,inode64
/ # ls -lah /tmp/.X11-unix/
total 0
drwxrwxrwt 2 nobody nobody 80 Mar 22 22:59 .
drwxrwxrwt 1 root root 18 Mar 22 23:01 ..
srwxrwxrwx 1 nobody nobody 0 Mar 22 22:59 X0
srwxrwxrwx 1 nobody nobody 0 Mar 22 22:59 X1
/ # ls -lah /testdir/
total 4K
drwxr-xr-x 1 1200 1200 16 Mar 22 23:00 .
drwxr-xr-x 1 root root 58 Mar 22 23:01 ..
-rw-r--r-- 1 1200 1200 6 Mar 22 23:00 testfile
I haven't followed Sysbox dev. much recently so I'm not sure if that's expected behaviour, I'll try to take a deeper look later.
Hi @joanbm, @mviereck, thanks for updating the issue.
ID-mapped mounts don't work on tmpfs (yet), so Sysbox won't ID-map-mount if it detects the underlying filesystem is tmpfs. The code is here and here.
Joan got a bit lucky in that his host did not have shiftfs on it; otherwise Sysbox would have tried to mount shiftfs on the socket (i.e., in a host with ID-mapped mounts and shiftfs, Sysbox first tries the former; if it doesn't work, uses the latter).
Regarding:
Before /tmp wasn't a tmpfs in my system
That's the case on my Ubuntu Focal host too; /tmp if not a tmpfs mount.
I tried replicating the same test on a 5.15 kernel (=supports IDMapped mounts) AND shiftfs and /tmp/.X11-unix was still bind-mounted, so the socket worked.
As far as I can see in the code, idmapped-mounts vs shiftfs is a global option, not a per-mount option: https://github.com/nestybox/sysbox-runc/blob/f70caf3131eef07476bcf47bd3956320ab5ce998/libsysbox/sysbox/sysbox.go#L252
If this is the correct behaviour, then this issue should perhaps be considered solved, as on newer kernels, sysbox will never use shiftfs anymore by default.
Other small notes:
- Just to test it, if I start sysbox-mgr with
--disable-idmapped-mountit does use shiftfs, both forbtrfsandtmpfs. - /tmp is a tmpfs on Arch where I'm running the tests, but also in Fedora.
Thanks @joanbm for the update.
I think in summary (for Sysbox v0.5.0):
-
If the host supports ID-mapped mounts (e.g., kernel >= 5.12), then mounting sockets into the Sysbox container will work fine. The only scenario where it may not work is if the socket is on tmpfs at host level. In this case Sysbox will mount the socket into the container but won't ID-map it (i.e., because ID-mapped mounts don't work on tmpfs yet), so the socket it will show up as
nobody:nogroupinside the container. Whether the socket it's usable or not from within the container depends on whether the socket gives "other" users read/write permissions. -
If the host does NOT support ID-mapped mounts but does support shiftfs (e.g., Ubuntu with kernel < 5.12), then mounting sockets into the Sysbox container will not work well: Sysbox will try to mount shiftfs on the socket, and this does not work as described in the first comment of this issue. The only exception is if the socket is chowned at host level to match the uid:gid assigned to the sysbox container. In this case, Sysbox won't mount shiftfs on the socket. Alternatively, one can disable shiftfs on Sysbox (e.g., via the sysbox-mgr's
--disable-shiftfsoption), but this may cause other issues (e.g., files to show up asnobody:nogroupinside the container) as shiftfs is in general beneficial.
Let's leave this issue open for the time being, as things don't yet work in all cases as we would wish for.
Once Linux adds support for ID-mapped-mounts over tmpfs, we can then close this issue in my view.