nomad-driver-podman icon indicating copy to clipboard operation
nomad-driver-podman copied to clipboard

rootless podman can't bind-mount allocdir

Open optiz0r opened this issue 1 year ago • 7 comments

Nomad considers filesystem permissions for the allocs directory to be outside of it's own security model (https://developer.hashicorp.com/nomad/docs/concepts/security)

Access (read or write) to the Nomad data directory - Information about the allocations scheduled to a Nomad client is persisted to its data directory. This would include any secrets in any of the allocation's file systems.

To protect the secrets written into job allocation directories from unprivileged local users with access to the nomad client, it's required to set restrictive permissions on the allocs directory or parent, such as 0700. The important part here is that the other permission does not include +x/1 to allow directory traversal, since secrets are written into subdirectories with accessible permissions (nobody:nobody 0777).

This seems to be fundamentally incompatible with rootless containers, since the unprivileged user needs to traverse into the alloc dir in order to stat them for bind-mounting into the container. Restrictive permissions yield Driver Failure errors such as the following on container startup

rpc error: code = Unknown desc = failed to start task, could not create container: cannot create container, status code: 500: {"cause":"permission denied","message":"statfs /data/nomad/server/alloc/1be2b692-465d-a1ac-54ff-e6f7a43c9fa4/alloc: permission denied","response":500}

One of the benefits of rootless containers and multiple sockets would be enabling stronger isolation between users on a host. Multiple sockets requires all the users which will run containers under nomad have access to the allocs directory, and therefore inherently all the secrets written to them for all jobs run by all users. This is sadly a dealbreaker for us, since it would allow secrets to be leaked across user boundaries.

The only way I can think to work around this would be nomad setting more restrictive permissions on the alloc directory itself (i.e. the one named after the job uid), e.g. setting ownership to match the podman socket owner, and 0700 permissions. Nomad itself when running as root would be able to bypass the restrictive permissions. Or POSIX ACLs on supported filesystems. I'm not sure if this can be practically implemented in the task driver alone, or if it would need support in Nomad core. At the very least, some information would need to be collected about which filesystem user the directory would need to be made accessible to. Currently the multiple-socket implementation doesn't understand which user "owns" the socket configured.

Alternatively, could this task driver bind-mount the alloc dir into some alternate path accessible by only the podman socket owner (e.g. beneath /run/user/UID), by bypass the more restrictive permission on the parent allocs dir?

optiz0r avatar Nov 10 '24 12:11 optiz0r

The only way I can think to work around this would be nomad setting more restrictive permissions on the alloc directory itself (i.e. the one named after the job uid), e.g. setting ownership to match the podman socket owner, and 0700 permissions. ... Alternatively, could this task driver bind-mount the alloc dir into some alternate path accessible by only the podman socket owner (e.g. beneath /run/user/UID), by bypass the more restrictive permission on the parent allocs dir?

In order for Nomad to match the podman socket owner, it would need to know there was a socket at all, which Nomad itself doesn't -- only the task driver has visibility into that kind of thing. So ultimately it would have to happen in the task driver. We have some precedence for having an alternate mount configuration for the recent exec2 driver. That driver has a different filesystem capability flag, which causes Nomad to create the alloc directory in alloc_mounts so that the driver can bind-mount it into the appropriate location. There might be promise in making that available to image-based file isolation as well.

tgross avatar Nov 11 '24 20:11 tgross

alloc_mounts might work here as long as the landlock/unveil permissions apply to all processes, not just those spawned by the Alloc. The alloc_mounts dir would have to be world-traversable (OK provided every alloc is landlocked), and then the actual alloc dir only accessible to the user for whom the alloc is running.

I'm not that familiar with landlock, do the access grants apply to all processes started by a single uid, or does it apply to a process (tree). With the way the podman task driver works, with a rootful process reaching out to a socket to start the container, I'm not sure the latter will be viable. So even with alloc_mounts, Nomad core is probably still going to need to know which uid(s) should have access to the alloc_dir. That'll have to be communicated to Nomad somehow, either via additional parameters in the job spec (task.user has other side effects in the docker/podman drivers, and so cannot be reused for this), or communicated back from the task driver itself somehow.

Other complications: allocs can contain tasks using different task drivers or be run using different user accounts. A documented limitation that all tasks using this must run under the same driver/user would be fine for me at least.

optiz0r avatar Nov 12 '24 09:11 optiz0r

Oh to be clear, I wasn't suggesting that we use Landlock for the podman driver. Landlock only locks out the process its being called from, so that doesn't really help. Just that having a separate source for the allocdir would allow for the following workflow:

  • Nomad creates the allocdir in the alloc_mounts dir.
  • The task driver bind-mounts that alloc_mounts dir into the standard allocdir location, setting permissions it knows about from the plugin config.
  • Then podman processes bind-mount from the standard allocdir location, using their own permissions.

Mind, this is all in my head and I haven't actually tried implementing any of it. :grin:

tgross avatar Nov 12 '24 13:11 tgross

Just that having a separate source for the allocdir would allow for the following workflow:

  • Nomad creates the allocdir in the alloc_mounts dir.
  • The task driver bind-mounts that alloc_mounts dir into the standard allocdir location, setting permissions it knows about from the plugin config.
  • Then podman processes bind-mount from the standard allocdir location, using their own permissions.

I'm not sure that works. If landlock is not being used, then the alloc_mounts dir needs to be just as protected as the normal nomad allocs dir, i.e. non-root should not be able to traverse through it. In which case neither is suitable for the rootless container. There can't be a single alternate allocs dir across all users, unless there's some external protection of some kind.

This task driver could bind-mount the alloc dir into a user-private location such as /run/user/UID. The driver does not currently understand unix identities for setting directory permissions, but could be extended to do so.

optiz0r avatar Nov 13 '24 10:11 optiz0r

I'm not sure that works. (snip)

Bah, yeah, you're right... this sort of thing has been one of the big barriers to a rootless Nomad client.

This task driver could bind-mount the alloc dir into a user-private location such as /run/user/UID. The driver does not currently understand unix identities for setting directory permissions, but could be extended to do so.

Sounds like the way to go.

tgross avatar Nov 13 '24 14:11 tgross

I wanted to chime in here and say I took a quick stab at this via the idea discussed:

bind-mount the alloc dir into a user-private location such as /run/user/UID

Everything worked nicely until I tested out more complex networking scenarios, i.e bridge networking. The issue being that in these scenarios, the Nomad client does the network namespace setup, but has no idea the task driver is going to run the task as non-root, so it creates the namespace as root, and then the container fails to join it.

I still like this approach but think we would need to bring this logic up into the Nomad client. That keeps all the actual setup in the client, and would hopefully make it easy to run other task drivers as non-root as well.

Unfortunately I need to prioritize some other things at the moment, but I'll put up my quite hacky code as a reference for others.

Edit: coming back to this comment to add that setting the task parameter user allows bridge mode to operate correctly.

mismithhisler avatar May 01 '25 18:05 mismithhisler

coming back to this comment to add that setting the task parameter user allows bridge mode to operate correctly.

This has negative consequences for the task driver though. It squashes all processes running inside the container to the task.user uid, which prevents multi-user containers from working, and typically breaks permissions for third-party images (meaning images must be extended to support running as unexpected UID, or not having namepaced-root privileges). This is currently how we're running rootful containers.

Part of what we're trying to support with rootless podman support in nomad is to avoid squashing containers into unexpected UIDs, so setting task.user to fix networking negates most of the benefits of rootless podman in the first place.

optiz0r avatar Jun 05 '25 10:06 optiz0r