option to manage volume permissions
Hello,
In Kubernetes/Openshift, when you mount a volume (through CSI or otherwise), you can configure its security context, and there determine, among other things, what Linux user and group the volume will be mounted as:
By default, Kubernetes recursively changes ownership and permissions for the contents of each volume to match the fsGroup specified in a Pod's securityContext when that volume is mounted.
Is there anything similar in Nomad? As far as I can tell, the only alternative is something like running the following pretask to ensure the configuration is what you want:
task "prep-disk" {
driver = "docker"
volume_mount {
volume = "nexus-volume"
destination = "/nexus-data/"
read_only = false
}
config {
image = "busybox:latest"
command = "sh"
args = ["-c", "chown -R 200:200 /nexus-data/"]
}
resources {
cpu = 200
memory = 128
}
lifecycle {
hook = "prestart"
sidecar = false
}
}
If this is the recommended way, is there any way it could be added as an example in the documentation? It would appear to me to be a relatively common use case. Thanks!
Hi @josemaia! I noticed this same problem with Sonatype Nexus container in the other issue we're working on with you. What you have here is unfortunately the only way to do this right now. I'm not entirely convinced that K8s is really doing the right thing here in allowing the job operator to recursively change ownership on the volume by default, but we'd need to look into it a bit. I'm going to mark this as an enhancement for future storage work.
Just hit this wall myself. It would be really nice if this would be documented somewhere.
Most stateful workloads whose docker image run as non-root user would hit this issue. For example:
- prometheus
- grafana
- loki
- elasticsearch
I'm not entirely convinced that K8s is really doing the right thing here in allowing the job operator to recursively change ownership on the volume by default, but we'd need to look into it a bit.
Me neither, that seems to be able to cause more problems than it is supposed to fix. Also recursively changing might add quite a bit of time to the pre-start operations (especially for large volumes). The better option (where possible) is imo to supply this information to the CSI plugin, like one can do with my nfs plugin for this exact reason: https://gitlab.com/rocketduck/csi-plugin-nfs/-/blob/main/nomad/example.volume -- this way it will create the volume with the proper modes and uid/gid.
The better option (where possible) is imo to supply this information to the CSI plugin, like one can do with my nfs plugin for this exact reason: https://gitlab.com/rocketduck/csi-plugin-nfs/-/blob/main/nomad/example.volume -- this way it will create the volume with the proper modes and uid/gid.
Agreed.
The other thing that comes up with this feature request, which has been on my mind of late, is user namespace remapping. Who "owns" the uid/gid here? The plugin is what does the mounting and any chown, but the plugin has either the host's uid/gid (if configured as we currently require) or its own uid/gid map (if configured with userns remapping itself), neither of which are the uid/gid map in the task that mounts the volume, much less some other task in another job entirely!
True, namespace remapping is becoming more & more common and there is no easy solution to that. Sure recursive chmod/chown is an option, but if at all it should be optional since doing that by default can often be unnecessary or simply be wrong. And it still leaves the question about who should be doing the chown, because as you said different tasks in a group might have different uids etc…
Supplying a uid/gid during volume creation is not perfect either (I guess most CSI plugins don't support it, but then again many CSI plugins simply do not work easily with nomad either ;)).
More information on what k8s does: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods as well as their support for pushing that down into the CSI layer: https://kubernetes-csi.github.io/docs/support-fsgroup.html
I'm not entirely convinced that K8s is really doing the right thing here in allowing the job operator to recursively change ownership on the volume by default, but we'd need to look into it a bit. I'm going to mark this as an enhancement for future storage work.
I agree, this behavior doesn't seem entirely correct. While at the same time, as a task author it is highly desirable to not have to think about how uid/gids map from the host that bootstrapped the volume's permissions, to the container/system needing access.
My very rough idea would be to have an ACL that controls whether tasks claiming a volume can designate the owning uid/guid of the files contained within. This may open up a cross-platform can of worms not worth opening.
And there are performance considerations as well. Recursive chowning is not particularly efficient.
Any progress with this? Currently it makes the ceph-csi driver impossible to use with MS Sql Server as it runs as user mssql:10001 with no permissions to the csi folder.
Edit: Got it to work with @josemaia task prep work around, thanks!
Hopefully, this will be fixed or documented somewhere properly.
@116davinder this is a feature request. That a feature doesn't exist isn't something to document (an infinite number non-existing features aren't documented either). In any event, the problem to solve is non-trivial and isn't currently on the roadmap. The issue will be updated if we decide to work on it.
@tgross, if users like me have to implement these hacky solutions then it is a problem for me to consider using nomad. I completely, understand that it is non-trival solution as you mentioned but not selected within 2-3 years of development, doesn't seem right to me at least.
I just ran into this as well. It would be really great if there was an officially supported way to do this. It's probably the only downside I have right now to using Nomad instead of k8s. And I really would prefer to use Nomad.
Hello from 2024 :) I just hit this too!
same here
Folks, please just do an emoji reaction on the issue description (ex. :+1:) if you'd just like to +1 the feature without additional context. That way you're not sending a notification to a few dozen people.
I slammed right into this earlier on my personal stack while moving a few things over to the CSI with GCP's PD -- the workaround above is a kind of "get the job done" fix, but doesn't feel great and I was a bit surprised other folks using CSI drivers haven't encountered this especially when taking off the shelf well designed containers that aren't just doing UID=0 GID=0.
In looking into the k8s implementation, here's what I understand about how it works:
- There's now support for delegating this to the CSI drivers themselves, if they expose the capability: https://github.com/kubernetes/kubernetes/blob/8d450ef773127374148abad4daaf28dac6cb2625/pkg/volume/csi/csi_mounter.go#L257 -- you of course still need the metadata to be passed in
- It'll fallback to the old way, otherwise: https://github.com/kubernetes/kubernetes/blob/8d450ef773127374148abad4daaf28dac6cb2625/pkg/volume/csi/csi_mounter.go#L333 (the recursive chmod we all love)
While I know it's not desirable, when you do dig into how it works (https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/volume_linux.go#L149), I think it's rather pragmatic -- there's a toggle where the recursive chmod only applies if the ownership of the root directory does not match: that is, generally for most it would only apply the first time you mount a volume, and it would generally be very quick, because unless you're using a snapshot as an volume or you've got existing data on your volume where the owning UID/GID of the root folder mounted does not match.. it's going to just change one or two things and get on with it.
I'm not really aware of any other way you would fix this in providers that essentially mount block devices onto the host.
@tgross I'm probably willing to sink a bit of time into having some kind of go at a first pass over implementing something akin to the Kubernetes approach, which it sounds like there may be some hesitancy to accept. If you have a spare moment or two in the next couple of weeks, would you + your team make a consideration for this being a path forward? My take is it'll greatly improve the UX.
The implementation I think I'd target given the above is a less configurable, strict implementation based on the ownership at the volume root, possibly configured in the volume_mount stanza:
volume_mount {
volume "example"
destination "/mnt"
ownership {
group = 4000
}
# or just `ownership_group = 4000`
}
The presence of this would perform the initial check against the mount point, and if there is a mismatch it would proceed to recursively adjust the group + file permissions (file permissions to ensure g+rw on writeable volumes, g+r on read-only)
How do we feel about something like this?
Bringing some context into this issue on what the spec says about this flag:
// If SP has VOLUME_MOUNT_GROUP node capability and CO provides
// this field then SP MUST ensure that the volume_mount_group
// parameter is passed as the group identifier to the underlying
// operating system mount system call, with the understanding
// that the set of available mount call parameters and/or
// mount implementations may vary across operating systems.
// Additionally, new file and/or directory entries written to
// the underlying filesystem SHOULD be permission-labeled in such a
// manner, unless otherwise modified by a workload, that they are
// both readable and writable by said mount group identifier.
// This is an OPTIONAL field.
mount(2) doesn't have such an identifier, so I guess that's expected as a -o option and filesystem dependent? Or is this the mount(8) group option (i.e. for the CLI command) which doesn't set permissions recursively in the way that the recursive chmod fallback would and so that's all being implemented in the plugin anyways?
That seems roughly reasonable, especially if we can punt to the plugins in common cases (we'd need to update our CSI client for that). But there's some open questions here I think:
- How does this interact with user namespace remapping?
- How does this interact with
access_modewhere multiple allocations can mount the same volume on the same host, potentially with different groups (which gets even weirder with the user namespace remapping). - In the recursive chmod case, how do we need to handle concurrency with the plugin RPCs? Right now we have per-volume serialization of batches of operations; does the recursive chmod have to happen inside that critical section?