toolbox icon indicating copy to clipboard operation
toolbox copied to clipboard

user container cannot receive uevents

Open martinpitt opened this issue 3 years ago • 28 comments

Describe the bug Programs in a standard Fedora toolbox don't receive uevents, i.e. udev events about hardware changes through the AF_NETLINK socket.

This is supposed to work with running the container in the host's network namespace (currently --network=host) and bind-mounting the host's dev. Issue #468 also sounds like this actually did work in the past.

Steps how to reproduce the behaviour

  1. toolbox create -r 37; toolbox enter -r 37
  2. sudo dnf install -y /usr/bin/udevadm
  3. Start udevadm monitor --udev once in the toolbox, and once on the host
  4. Trigger some /dev change. Plug in or remove a Yubikey, mouse, or keyboard; or do something like sudo ip link add name e1 type veth or sudo modprobe scsi_debug.

Expected behaviour The host's udev monitor will see the event, like

UDEV  [3612.352892] add      /devices/pseudo_0/adapter0/host0/target0:0:0/0:0:0:0/block/sda (block)

for loading scsi_debug. Ideally, the toolbox should see this as well. This would make it possible to run things like gvfs, sway, calibre, or anything else that needs to react to hardware changes in toolbox.

Actual behaviour udevadm in toobox does not see any event.

Output of toolbox --version (v0.0.90+) toolbox version 0.0.99.3

Output of podman version

Client:       Podman Engine
Version:      4.3.1
API Version:  4.3.1
Go Version:   go1.19.2
Built:        Fri Nov 11 16:01:27 2022
OS/Arch:      linux/amd64

Podman package info (rpm -q podman) podman-4.3.1-1.fc37.x86_64

Info about your OS Fedora 37 OSTree (Silberblue-ish, but a custom build).

Additional context I'm experimenting with running my whole desktop from toolbox/podman, see https://github.com/martinpitt/swaypod and https://github.com/martinpitt/swaypod/issues/1 .

Perhaps this is doomed to fail, but this issue is pretty much the only blocker, so I'd like to at least understand why :grin: Thanks!

martinpitt avatar Dec 18 '22 11:12 martinpitt

I'm trying to reduce this to a podman command, as I figure that's easier to investigate or possibly reassign to podman.

This also does not work:

podman run -it --rm --privileged --net=host -v /dev/:/dev -v /run/udev:/run/udev registry.fedoraproject.org/fedora:latest bash
dnf install -y systemd-udev
udevadm monitor --udev

Note that toolbox already uses both options, and these are the usual internet recipes for "fix uevents".

However, it does work when running the container as root, i.e. prepend sudo to the podman run. It even works without --privileged, which is good because privileged system containers are way too dangerous:

sudo podman run -it --rm --net=host -v /dev/:/dev -v /run/udev:/run/udev registry.fedoraproject.org/fedora:latest bash

So somehow this is only broken for user containers, even though reading AF_NETLINK isn't privileged at all: udevadm monitor works fine as user, and a lot of desktop functionality (display changes, gvfs etc.) relies on that.

martinpitt avatar Dec 18 '22 12:12 martinpitt

This is supposed to work with running the container in the host's network namespace (currently --network=host) and bind-mounting the host's dev. Issue https://github.com/containers/toolbox/issues/468 also sounds like this actually did work in the past.

The information on this website is misleading.

If you share the network namespace with the host then you will see all uevents that the host sees. Indeed that's independent of whether you're running in a user namespace or not so long as the uevent socket is opened in a network namespace that is owned by the initial user namespace. Which is ofc the case if you share the network namespace.

However, uevents carry uid parameters in their associated creds. That uid was generated based on the owning user namespace of the network namespace of the uevent socket. Since the uevent socket was opened in the network namespace of the host the uid is generated from the initial user namespace.

Now, if udev in the container listens for uevents it will discard all uevents that were sent from a non-root uid. Since the uid of the uevent was generated in the initial user namespace the uid in the creds parameter will be 0. Since that is by default the uid that uevents are tagged with since forever. When the kernel receives that uevent in your container that runs in a user namespace it will resolve from(user-namespace-mapping, uevent->creds->uid == 0) which will give you 65534 since that uid isn't mapped in your user namespace. So udev in your container will discard it. You should be able to see this via:

udevadm --debug monitor

sd-device-monitor: Sender uid=65534, message ignored.
sd-device-monitor: Sender uid=65534, message ignored.
sd-device-monitor: Sender uid=65534, message ignored.
sd-device-monitor: Sender uid=65534, message ignored.
sd-device-monitor: Sender uid=65534, message ignored.
sd-device-monitor: Sender uid=65534, message ignored.
sd-device-monitor: Sender uid=65534, message ignored.
sd-device-monitor: Sender uid=65534, message ignored.
sd-device-monitor: Sender uid=65534, message ignored.
sd-device-monitor: Sender uid=65534, message ignored.
sd-device-monitor: Sender uid=65534, message ignored.
sd-device-monitor: Sender uid=65534, message ignored.
sd-device-monitor: Sender uid=65534, message ignored.

brauner avatar Dec 18 '22 13:12 brauner

Thanks @brauner for the explanation! That's indeed what I see in the podman container, even with --userns=keep-id --user root:root. So I suppose getting system uevents in user namespaces would need that "container manager needs to listen to and inject into user ns" trick that you mentioned on mastodon.

martinpitt avatar Dec 18 '22 14:12 martinpitt

Interesting.

Does this mean that Toolbx/Podman should set up some sort of bridge to forward the uevents? @brauner , you mentioned cutting off the seqnum field of the uevent. Is it possible to replace the UID to be the user's non-zero UID? I have to admit that this is way beyond my understanding of udev. :)

Note that Toolbx doesn't bind mount the entire /run/udev from the host, but only /run/udev/data. So, the /run/udev/control socket is absent in the container, but since it's only accessible by UID=0 on the host, it's probably useless inside a rootless container.

debarshiray avatar Feb 01 '23 01:02 debarshiray

Interesting.

Does this mean that Toolbx/Podman should set up some sort of bridge to forward the uevents? @brauner , you mentioned cutting off the seqnum field of the uevent. Is it possible to replace the UID to be the user's non-zero UID? I have to admit that this way beyond my understanding of udev. :)

Not currently. The uevent is generated by the kernel and the kernel's uid/gid is used because of that.

brauner avatar Feb 01 '23 09:02 brauner

Even if you inject it is my point.

brauner avatar Feb 01 '23 09:02 brauner

FTR, I was playing around with that a few weeks ago, and so far I couldn't make it work -- writing back the received uevents without SEQNUM into the container's uevent socket does absolutely nothing.

martinpitt avatar Feb 01 '23 13:02 martinpitt

FTR, I was playing around with that a few weeks ago, and so far I couldn't make it work -- writing back the received uevents without SEQNUM into the container's uevent socket does absolutely nothing.

Is there code anywhere that I can look at?

brauner avatar Feb 01 '23 14:02 brauner

Does this mean that Toolbx/Podman should set up some sort of bridge to forward the uevents? @brauner , you mentioned cutting off the seqnum field of the uevent. Is it possible to replace the UID to be the user's non-zero UID? I have to admit that this way beyond my understanding of udev. :)

Not currently. The uevent is generated by the kernel and the kernel's uid/gid is used because of that.

I see.

Should I interpret the not currently as there is no way to make this work or as an indication of future possibilities? :)

debarshiray avatar Feb 01 '23 14:02 debarshiray

FTR, I was playing around with that a few weeks ago, and so far I couldn't make it work -- writing back the received uevents without SEQNUM into the container's uevent socket does absolutely nothing.

Is there code anywhere that I can look at?

Fwiw:

static int inject_uevent(const char *uevent, size_t len)
{
        __do_close int sock_fd = -EBADF;
        __do_free struct nlmsg *nlmsg = NULL;
        int ret;
        char *umsg = NULL;

        sock_fd = netlink_open(NETLINK_KOBJECT_UEVENT);
        if (sock_fd < 0)
                return -1;

        nlmsg = nlmsg_alloc(len);
        if (!nlmsg)
                return -1;

        nlmsg->nlmsghdr->nlmsg_flags = NLM_F_ACK | NLM_F_REQUEST;
        nlmsg->nlmsghdr->nlmsg_type = UEVENT_SEND;
        nlmsg->nlmsghdr->nlmsg_pid = 0;

        umsg = nlmsg_reserve_unaligned(nlmsg, len);
        if (!umsg)
                return -1;

        memcpy(umsg, uevent, len);

        ret = netlink_transaction(sock_fd, nlmsg->nlmsghdr, nlmsg->nlmsghdr);
        if (ret < 0)
                return -1;

        return 0;
}

        attach_userns_fd(ns_fd);

        if (!change_namespaces(pidfd, ns_fd, CLONE_NEWNET)) {
                fprintf(stderr, "Failed to setns to container network namespace: %s\n", strerror(errno));
                _exit(1);
        }

        if (inject_uevent(uevent, len) < 0) {
                fprintf(stderr, "Failed to inject uevent\n");
                _exit(1);
        }

is a rough draft how to actually inject it.

brauner avatar Feb 01 '23 14:02 brauner

And you need to keep in mind that uevents have a particular format where each property is '\0' separated.

brauner avatar Feb 01 '23 14:02 brauner

I pushed my initial experiment to https://github.com/martinpitt/uevent-container-forwarder . It does not actually work. I tried to run this all inside the container (as container user root) -- after all, it receives the host's uevents just fine, and it has to inject it inside the container namespace anyway, right?

martinpitt avatar Feb 01 '23 14:02 martinpitt

In your code is UEVENT_SEND actually raised when writing into the socket?

#ifndef UEVENT_SEND
#define UEVENT_SEND 16
#endif

in nlmsg_type?

brauner avatar Feb 01 '23 14:02 brauner

I pushed my initial experiment to https://github.com/martinpitt/uevent-container-forwarder . It does not actually work. I tried to run this all inside the container (as container user root) -- after all, it receives the host's uevents just fine, and it has to inject it inside the container namespace anyway, right?

That depends: If the container runs in the network namespace of the host but in a different user namespace uevents are received. If it does run in a netns that is owned by another userns (i.e., the container uses a netns+userns such that the netns is owned by the container userns) then it doesn't.

So what we for example use this for is to listen for device events on the host, then use the uevents that we care about, cut of the seqnum, attach to the container userns+netns and inject the uevent.

brauner avatar Feb 01 '23 14:02 brauner

(a) network namespace owned by initial user namespace -> receive all uevents (b) network namespace owned by a non-initial usernamespace (unshare(CLONE_NEWUSER | CLONE_NEWNET)) -> no uevents apart from network devices that are properly namespaced per netns

To inject uevent: (1) receive it in (a) (2) attach to (b) and inject it via UEVENT_SEND

brauner avatar Feb 01 '23 14:02 brauner

I ran into this issue before and ended up figuring out that messages are only send to netns owned by the root userns. What I didn't really understand is why. From what you said I guess it's about network devices that are not "properly namespaced per netns". How would proper namespacing for e.g. a drm/kms node look like?

swick avatar Feb 01 '23 16:02 swick

I ran into this issue before and ended up figuring out that messages are only send to netns owned by the root userns. What I

See above.

didn't really understand is why. From what you said I guess it's about network devices that are not "properly namespaced per

Network devices are properly namespaced is what I said. They generate an "add" uevent in the target network namespace that isn't seen in any other network namespace and the /sys/fs/class/net entries are properly namespaced as well. I did that work a few years back. See https://patchwork.kernel.org/project/linux-pm/cover/[email protected]/ for parts of it.

netns". How would proper namespacing for e.g. a drm/kms node look like?

You would need full namespace of device numbers, devtmpfs, sysfs, kernfs, and all of the core device infrastructure. I have a lot of that work done in the context of loopfs but that has been intensely resisted by upstream for years because in their mind devices belong to the host not ever to a container which is debatable.

Uevent injection allows to you get something close to this from userspace provided the correct infrastructure would be built.

brauner avatar Feb 01 '23 16:02 brauner

Network devices are properly namespaced is what I said.

Yeah, I got that. I wondered why events from some devices are restricted to netns belonging to the initial userns and the answer to that seems to be "because they are not properly namespaced". Still doesn't explain why this is necessary. Why can't we send the uevent to all netns? Is it just to avoid the traffic?

You would need full namespace of device numbers, devtmpfs, sysfs, kernfs, and all of the core device infrastructure.

Mh, right.

It is kind of weird. DRM KMS nodes are not network devices. One would expect them to work fine if you receive a fd but e.g. for hotplug detection you need the uevent which is subject to netns rules. We already send around KMS fds and just hope it works.

Makes me wonder if there isn't another mechanism to get events to user space for e.g. hotplug which isn't uevents and works just fine as long as you have the fd.

swick avatar Feb 01 '23 22:02 swick

Network devices are properly namespaced is what I said.

Yeah, I got that. I wondered why events from some devices are restricted to netns belonging to the initial userns and the answer to that seems to be "because they are not properly namespaced". Still doesn't explain why this is necessary. Why can't we send the uevent to all netns? Is it just to avoid the traffic?

Network devices have one network namespace as owner. So if you move network device from the initial network namespace to another network namespace then a REMOVE event will be generated in the network namespace the device is moved from and an ADD event in the network namespace it is moved to. Why should an ADD event be generated in a network namespace for a device that doesn't exist in there and isn't even accessible in there - such as a veth device.

If we generate uevents for all namespaces then we not just flood everyone with useless messages for devices they don't have access to it also makes it difficult to handle properly namespaced devices correctly. Plus it's a backward compatiblity issue as well.

brauner avatar Feb 02 '23 08:02 brauner

Sorry for being this naive about everything. If a device is not properly namespaced then it can be used in all namespaces, right? In that case the message would be relevant in all namespaces.

Plus it's a backward compatiblity issue as well.

How is that?

swick avatar Feb 08 '23 20:02 swick

I pushed my initial experiment to https://github.com/martinpitt/uevent-container-forwarder . It does not actually work. I tried to run this all inside the container (as container user root) -- after all, it receives the host's uevents just fine, and it has to inject it inside the container namespace anyway, right?

That depends: If the container runs in the network namespace of the host but in a different user namespace uevents are received. If it does run in a netns that is owned by another userns (i.e., the container uses a netns+userns such that the netns is owned by the container userns) then it doesn't.

Toolbx containers use the same network namespace as the host, but do use a (actually two) separate user namespace(s). Assuming that we want to get this to work with Toolbx, we can ignore the second (ie., the netns owned by separate userns) scenario, no?

debarshiray avatar Feb 10 '23 02:02 debarshiray

Ping.

Do you think we can do something to make the situation better here? Unfortunately, I am a bit clueless about the depths of udev, so I am unable to make out where things stand at the moment.

debarshiray avatar Mar 09 '23 14:03 debarshiray

(a) network namespace owned by initial user namespace -> receive all uevents (b) network namespace owned by a non-initial usernamespace (unshare(CLONE_NEWUSER | CLONE_NEWNET)) -> no > uevents apart from network devices that are properly namespaced per netns

@brauner I just checked again and toolboxes use the same netns as the initial user namespace, however I can observe uevents with udevadm monitor on the host but not in the toolbox.

Something here is not adding up.

swick avatar Aug 14 '23 21:08 swick

ugh, completely forgot the part about the uid. all makes sense again...

swick avatar Aug 21 '23 14:08 swick

Hello, have you found any solution @martinpitt ? I am running my desktop environment in a container and Hyprland is not receiving monitor events... Do you have the script that reinjects the events with the virtual UID of the container's root user?

mlophez avatar Oct 01 '24 06:10 mlophez

@mlophez No, I didn't get any further, I'm afraid. I abandoned the "desktop env from a container" idea, and just continue to build my whole OS as an OSTree image in a container format: https://piware.de/post/2020-12-13-ostree-sway/ I'm still very happy with this.

martinpitt avatar Oct 01 '24 07:10 martinpitt