systemd RFC: Running unpriv containers from directory trees in $HOME

Component

systemd-nsresourced

Describe the solution you'd like

I am not a fan of it, but it is kinda popular to have container trees placed directly in a subdir of an unpriv home dir. And who knows maybe it ias handy in many cases.

So far unpriv nspawn doesn't support that model. But let's see if we can do something about this to improve the situation, without having to resort to persistent subuid/subgid assignments, and with a somewhat sane security model.

hence, here's an idea:

let's set aside some fixed 64K UID/GID range, that is called the fixed container range, "FCR"
in nsresourced, provide an api that people can pass in an fd to a directory inode + an userns fd. nsresourced will then check if the inode is owned by the FCR+0 UID, and if the parent inode is owned by the client. It then asks polkit for OK (which we'd allow by default). If so we return a cloned mount of the provided directory, with idmapping applied, from FCR+0 to the provided userns.
in nspawn we'd use this when called unpriv with a directory image.

From my PoV I think the security model is relatively unproblematic. Ideas?

UID-based quota would be useless for FCR, but it inherently is for non-disk-image based containers anyway.

open questions:

how to initially create an FCR owned dir initially like this? (maybe just an additoinal api in nsresourced)
how to finally remove an FCR owned dir initially like this? (same? but this is difficult: i.e. pin the dir to remove without chance for user to interfere)

the model would be entirely generic of course, i.e. any other container manager could make their containers owned by FCR and then acquire a userns + uidmap mount for it too, this way.

Aug 27 '24 13:08 poettering

So currently in mkosi the entire tree is owned by the user itself and I use --private-users=$UID to map it to root. Extremely hacky, but the only thing that works with nspawn at the moment. So something like this would be great to get rid of that hack and allow running nspawn without privs.

Aug 27 '24 13:08 DaanDeMeyer

In mkosi I've also tied booting from a directory tree (both VM with virtiofs and nspawn) to --ephemeral. In other words, you have to use --ephemeral when you boot such trees, otherwise you'll end up with non user owned files in your home dir. It might make sense to imply the same with this API.

Another insane approach that I thought of was using seccomp notify or BPF to intercept chown, stat and whatnot and store non-root UIDs/GIDs as xattrs on the files/dirs. itself so that the actual files/dirs can all be owned by the user itself on the host.

Aug 28 '24 14:08 DaanDeMeyer

It might make sense to imply the same with this API.

Hmm? no, this would grant a full 64K to each container, and it would be mapped in full to the suggested "FCR" range.

Aug 28 '24 16:08 poettering

It might make sense to imply the same with this API.

Hmm? no, this would grant a full 64K to each container, and it would be mapped in full to the suggested "FCR" range.

Yes but you still only have one user UID. So you'd still end up with files owned by different UIDs in the user's home directory with this approach no?

Aug 28 '24 16:08 DaanDeMeyer

Yes but you still only have one user UID. So you'd still end up with files owned by different UIDs in the user's home directory with this approach no?

Nope. nsresourced hands out 64K UID assignments to unpriv users, if they supply an uninitialized userns. The trick is that processes in that userns later will not be able to write anywhere except for allowlisted mounts (this is enforced by lsmbpf). Thus the idea here is that this happens:

nspawn allocates a userns
nspawn asks nsresourced to transiently assign a 64K range to it
nspawn asks nsresourced to attach a directory to it, which nsresourced makes a clone off, then installs a uid mapping on that clone that maps the FCR range of the files to the UID range of the client's userns
nspawn then moves its payload into that userns

This way the nspawn container has a full 64K at runtime from some dynamic high UID range, but on disk this is mapped to the fixed FCR range. This way, the dynamic UID assignments are strictly transitive, never hit the disk.

I think this gets us pretty OK behaviour: runtime objects (i.e. processes) are neatly isolated via UIDs, because each container will get a dynamic range assigned, different from all other concurrent containers. And on-disk objects get a fixed UID range, but can express a full 64K range. Dynamic ranges are never persisted.

Aug 28 '24 16:08 poettering

Yes but you still only have one user UID. So you'd still end up with files owned by different UIDs in the user's home directory with this approach no?

Nope. nsresourced hands out 64K UID assignments to unpriv users, if they supply an uninitialized userns. The trick is that processes in that userns later will not be able to write anywhere except for allowlisted mounts (this is enforced by lsmbpf). Thus the idea here is that this happens:
1. nspawn allocates a userns

2. nspawn asks nsresourced to transiently assign a 64K range to it

3. nspawn asks nsresourced to attach a directory to it, which nsresourced makes a clone off, then installs a uid mapping on that clone that maps the FCR range to the UID range of the client's userns

4. nspawn then moves its payload into that userns
This way the nspawn container has a full 64K at runtime from some dynamic high UID range, but on disk this is mapped to the fixed FCR range. This way, the dynamic UID assignments are strictly transitive, never hit the disk.

Right but you'd have to chown the entire directory tree to the FCR before being able to use this I guess? That's the part I was unclear about. And you wouldn't be able to just rm -rf the dir as the UIDs would be the FCR's and not your own user's UID.

Aug 28 '24 16:08 DaanDeMeyer

Right but you'd have to chown the entire directory tree to the FCR before being able to use this I guess? That's the part I was unclear about. And you wouldn't be able to just rm -rf the dir as the UIDs would be the FCR's and not your own user's UID.

yeah, see the open questions above.

but i think we should just provide an api to create a dir owned by FCR+0 in a dir owned by the user. and an api to remove a dir owned by FCR+0 in a dir owned by the user. both in nsresourced too.

hence container manager, could create a container dir with that, then unpack a container into it. and finally remove it again.

maybe we could even provide a convenience tool somewhere that supports unpacking some archive into such a dir. i.e. we support generating archives from DDIs after all already, linking against libarchive.

It might be cool for systemd-repart too if you can specify a tarball to turn into a partition/DDI, we could use the same mechanism for that.

Aug 28 '24 17:08 poettering

I still wonder if you could change things up a bit if you enforce --ephemeral or --volatile or something similar where the tree itself is not owned by the FCR but by the user's UID, and you either copy and chown that or put it in an overlayfs with the user's UID mapped to root in the dynamic userns where all writes go to some temporary directory that is removed again when the container exits. Then you don't need files in the user's home directory owned by UIDs in the FCR. Of course you wouldn't be able to persist anything like this, so the use cases would be more limited. Though you could mount in more directories where files could be persisted if you wanted, for example with --bind-user. Then you'd have ephemeral dev containers where you can persist stuff in your home directory, but all writes outside that are ephemeral.

Aug 28 '24 17:08 DaanDeMeyer

Thinking further, for most testing use cases using containers that need a full system container, you'd only really need 64k UIDs for /var, /tmp, /dev and /run. All of those except /var are already tmpfs and if you make journald smart enough to notice if /var is tmpfs and adjust its config, you can probably get away with /var being a tmpfs as well.

With that assumption all you'd really need is for nsresourced to give you a tmpfs with a 64k UID range, and in that case, you don't need to worry about persisting anything in the first place, so no need for the FCR at all. For the regular OS tree you just map the user's UID to root and the home directory you map in as the user itself or mapped to root.

Aug 28 '24 18:08 DaanDeMeyer

You could also create the /var directory somewhere in /var on the host and add a tmpfiles snippet to automatically clean it up just like we do for RootEphemeral=. Then it is persisted but outside of the user's home directory and automatically cleaned up after the container exits.

Aug 28 '24 19:08 DaanDeMeyer

So hmm, I guess for me all this wouldn't work anyway right now, since uidmaps are not stackable at the moment, and homed applies an uidmap on $HOME, hence spawning containers off it directly cannot work. Sniff.

Aug 29 '24 14:08 poettering

I guess this is pretty much implemented these days, since 88252ca88932b733ead989b6c5cece22ea37941b. Closing.

Oct 29 '25 15:10 poettering