Add support for chowning across upgrades
Splitting this out from so many issues; https://github.com/bootc-dev/bootc/issues/673 and https://gitlab.com/fedora/bootc/tracker/-/issues/50 are big ones, but those have a lot of links to many prior discussions.
In this proposal we would add to bootc first class support for automatically resetting file ownership even if uids drift.
Let's take the case of openvswitch again. It has /etc/openvswitch opened by the openvswitch user/group, and /var/log/openvswitch.
First, let's now assume that /var/log/openvswitch gets hard converted to tmpfiles.d (in the package by default). This is what we want anyways, and scopes the problem down to /etc.
Option A: Forcibly allocating at system instantiation time
If systemd-sysusers is in use, we know whether a uid/gid is floating or not. Here, we could have something like bootc container commit add an xattr user.bootc.owner with a value <name>:<group> syntax (where either of these could be empty). The idea behind using xattrs is that even though tar (as used by container runtimes) has support for symbolic usernames, container runtimes don't.
When a deployment is being created in a given stateroot, we basically do:
- write new deployment (including copying
/etcfrom current one, with current value ofopenvswitchuser). The new deployment's/etcwould have these xattrs (from the tar stream) - Run systemd-sysusers in the new deployment root to ensure we pick up new users/groups pre-upgrade
- Walk
/etcin the new deployment, and chown files using the/etc/passwdfrom the new deployment's password database
This would work pretty well because of how we inherit ostree's multiple copies of /etc; we wouldn't be mutating the system live at all.
Option B:
I was going to type something else here but actually I like the above enough that I think it makes the most sense.
The idea behind using xattrs is that even though tar (as used by container runtimes) has support for symbolic usernames, container runtimes don't.
That said we should definitely think about what it might look like to add such support to container runtimes (and the OCI spec, which has little to say currently AFAIK on this topic).
One important thing to understand here is that the tar format requires numeric uid/gid, with optional username/groupname. At least podman for example always omits username/groupname from the archives it generates.
But (in a quick test) including username/groupname in a container image works fine, the usernames are just ignored by the tar importer.
Goal: Floating users/groups in OCI
Our goal here is to define a mechanism in OCI to declare/create "floating" users and groups.
Mechanism to provide symbolic names
Ideally there'd be a nice way at container build time to say "I want this file ownership to be floating". The only real compatible way I can imagine to do this is setting an xattr, let's say user.oci.owner.user and/or user.oci.owner.group or so.
When creating the tar stream for a given layer, the container build environment would detect these xattrs and mark them as floating in the tar stream.
Tar serialization
There's no standard really in tar to say "this user is floating" - you have to provide numbers (and the same with cpio, where there's no symbolic names at all). Using root is easiest.
We can't just consume the user.oci.owner.user xattr to the standard tar uname header - we couldn't distinguish then if the username was injected intentionally. Simplest is to keep the xattr, and leave the standard tar headers unfilled (just to avoid duplication).
Dynamic mapping at runtime
This is of course the hard part.
Dynamic mapping for bootc
For bootc, there's going to be a huge difference between dynamic uids for files in /etc - which we can pretty easily chown based on the user database, and files in /usr (or the base image) which is messier.
Dynamic mapping for /etc
For a persistent /etc (the default) as noted above this is pretty straightforward because we already have a writable copy, so we can just chown files based on the current user database. Simple and straightforward.
Dynamic mapping for image content
There's not a lot of good use cases for this, honestly. The one most likely to hit is setuid binaries (ref https://github.com/cockpit-project/cockpit/pull/16811 ) - cockpit had a floating user with a setuid binary.
But there are some random oddball cases (as noted in a fedora-devel thread) where e.g. some package made their systemd unit in /usr/lib/systemd/system/foo.service owned by their floating user, which is basically nonsensical...
Anyways though, let's say we want to handle this. One thing that we have now with composefs (in an unsealed state) is that it can actually function as an indirection layer! We can leave the object store on disk as root, and when synthesizing the composefs for the new OS, set ownership just in the composefs metadata.
But what about sealed composefs (ref https://github.com/bootc-dev/bootc/issues/1190 )? One thing I could imagine here is adding another overlayfs created in the initramfs for which we do a dynamic chown. Ugly, but perhaps viable.
But in the end though I think sealed systems really want transient /etc anyways because there's just Too Much Stuff there which can be leveraged into arbitrary code. Maybe, for some use cases we get pushed into supporting a symlink like /etc/passwd -> /var/passwd, but ugh.
My hope is everyone doing sealed systems can do static users (always allocated at build time) and DynamicUser=yes.
Dynamic mapping for container runtimes
It's not clear to me that there's a strong enough use case for this really that we'd need to support it. It may just be sufficient to have a standard that in theory marks them as floating, but it's only actually implemented by tools like bootc to start.
Dynamic mapping for users should be avoided as much as possible, UIDs in Unix like system tend to persist in way that are not always immediately obvious.
Wherever possible UIDs should be standardized and/or stored in a permanent way that can be sourced during upgrades so a UID never changes for the life of a system.
Even better if they are always the same also across multiple systems, because network file system tend to break dynamic UID allocation once resources need to be shared.
Dynamic mapping for users should be avoided as much as possible
I wouldn't say that; runtime dynamism via e.g. DynamicUser=yes has this covered well and should actually be encouraged (where it's applicable, which is definitely not everything). The problem is dynamic allocation at build time.
Wherever possible UIDs should be standardized and/or stored in a permanent way that can be sourced during upgrades so a UID never changes for the life of a system.
Yes, for dynamic allocation that involves content owned in the image...that's what this proposal is about right? Do you have any specific concerns/suggestions with the strawman proposal above?
I am not sure I really understand what you are proposing to be honest, which is why I did not comment in detail.
However dynamic uid allocation should be reserved exclusively for really ephemeral users that do not own any file except in /run or /tmp
Any system user that needs to own data in /usr or will own permanent data in some data partition should have a permanent uid allocation.
For well-known system users it is probably best to have some form of global allocation.
Defining the allocation at image creation time is risky as there is no mechanism to ensure an upgraded image will retain the same ids.
Resetting ownership of files is not necessarily possible once you involve external storage, so that should be considered as last resort approach.
I'm not sure if this has been mentioned before across the various issues around this, though another approach for the drift issue I think is basically to do something similar to rpm-ostree's logic around UID/GID persistence. Here's how it could work:
- build pipeline fetches UID/GID mappings from previous build (I guess currently that's
/usr/lib/{passwd,group}though this is technically independent of nss-altfiles) - build pipeline converts it to systemd-sysusers (we can provide tooling for this -- one nice way to do this actually which combines this step and the previous is for
bootcitself to know how to do this, and then one could dopodman run $IMAGE bootc generate-sysusers) - build pipeline feeds that into the build process for the next build (e.g. via a
--secret)- this works for both FROM scratch builds (see https://github.com/coreos/rpm-ostree/pull/5427) and true derivations, where one would install the sysusers entries before doing any package installs
(I know this is mixing up bootc vs Fedora-specific goop here, but the high-level idea itself is generic over systemd-sysusers.)
Filed https://github.com/bootc-dev/bootc/issues/1562 and just for reference, anyone affected by this a current best practice is to inject a tmpfiles.d unit which uses Z to chown.
Just dropping this here, I get the sense people have a (totally understandable!) feeling of "bootc is broken". But here's the thing, again this problem is not specific to bootc. As touched on in https://lwn.net/Articles/1018082/
But basically if you have e.g.
FROM <baseimage>
RUN apt|dnf install postgresql
VOLUME /var/lib/postgres
(That you run via docker/podman/kube - nothing related to bootc!)
And the postgres package allocates a floating postgres user, and you do the production thing of having a dedicated persistent volume for the database - you will get burned by the postgres UID changing in later versions of the image build and having a mismatch with the persistent volume.
bootc just enables persistent volumes by default for /etc and /var - unlike default docker/podman which have a single / (as an overlayfs) that does not in any way persist across image updates.
this problem is not specific to bootc
That's true, but it's an unsolved problem. I'm currently using archlinux inside ostree (without bootc) and have the exact same issues. Your Option A sounds like it would work, since it basically extends the special logic for /etc to fix ownership of files which come from the image.
The booc documentation says With the exception of setuid or setgid binaries (which should also be strongly avoided), there is generally no valid reason for having non-root owned files in /usr or other runtime-immutable directories.. Well yeah, but there are many setuid/setgid binaries and there's currently no solution for them. Look at my system for example:
# find . \! -user 0 -or \! -group 0 -print0 | xargs -0 -I {} stat --format 'Access: (%10.10A) Uid: (%5u) Gid: (%5g): %n' {}
Access: (drwxr-x---) Uid: ( 0) Gid: ( 102): ./etc/polkit-1/rules.d
Access: (-rw-r-----) Uid: ( 0) Gid: ( 971): ./etc/brlapi.key
Access: (drwxr-sr-x) Uid: ( 0) Gid: ( 981): ./var/log/journal
Access: (drwxr-sr-x) Uid: ( 0) Gid: ( 976): ./var/log/journal/remote
Access: (-rw-rw-r--) Uid: ( 0) Gid: ( 997): ./var/log/wtmp
Access: (-rw-rw----) Uid: ( 0) Gid: ( 997): ./var/log/btmp
Access: (-rw-rw-r--) Uid: ( 0) Gid: ( 997): ./var/log/lastlog
Access: (drwxrwxr-x) Uid: ( 0) Gid: ( 50): ./var/games
Access: (-rwxr-s---) Uid: ( 0) Gid: ( 982): ./usr/bin/groupmems
Access: (-rwxr-sr-x) Uid: ( 0) Gid: ( 5): ./usr/bin/wall
Access: (-rwxr-sr-x) Uid: ( 0) Gid: ( 5): ./usr/bin/write
Access: (---s--x---) Uid: ( 0) Gid: ( 81): ./usr/lib/dbus-daemon-launch-helper
Access: (-rwxr-sr-x) Uid: ( 0) Gid: ( 997): ./usr/lib/utempter/utempter
Access: (dr-xr-xr-x) Uid: ( 0) Gid: ( 11): ./srv/ftp
Now, a distro could simply modify all sysuser.d configs to include fixed IDs so these id's never change, but that's not failsafe either, because the IDs in sysusers.d are just a suggestion. If the allocation already exists, you end up with a different ID. That can happen when:
- The static IDs got added before the first release of the ostree version of the distro.
- The static IDs have changed (by accident)
- The user installed a package after using
ostree admin unlockand later, the package became part of the image.
So effectively, I think it's impossible to have images ship with the correct IDs, because there's no such thing as a correct ID, you have to be able to deal with any passwd-contents imaginable. That's why sysuser.d exists and it does this perfectly. The issue with ostree is that it's a mount of a readonly image, which does not consider allocations at all.
So, I think there are only two options:
- On every boot, generate an overlay which adjust permissions of files inside the image to match the contents of /etc.
- Mount the rootfs with mapped uids.
An additional concern that just came to my mind while writing this: What happens if we roll back and forth between different versions of images and /etc contents? One could imagine a situation where the same user gets allocated with different IDs and then we end up with incompatible permissions in /var.