Cannot execute binaries stored in an NFS Server running on a Bottlerocket node
Image I'm using:
Bottlerocket OS 1.20.4 (aws-k8s-1.30)
Context: We have some software that runs multiple pods for multiple stages in a pipeline. To be able to complete this dynamically and allow retries on specific steps, we spawn short-lived pods that connect to an NFS server running in-cluster for its ephemeral data. A typical installation would have the orchestrator and the NFS server to begin with. When the orchestrator receives a piece of work, it will:
- Create a subfolder in the NFS server, and download any required executables to it
- Spawn a pod, which runs the executables in the subfolder
- Once the pod has finished running, the subfolder gets removed
The NFS server is a simple variant of this alpine server.
What I expected to happen: When running an NFS Server in a container in bottlerocket, you are able to execute files on the share from a mount in a different container.
What actually happened:
The NFSD process is denied execute access. This is exhibited in this AVC denial log:
Jul 26 01:20:06 ip-10-0-19-55.ap-southeast-2.compute.internal audit[2830356]: AVC avc: denied { execute } for pid=2830356 comm="nfsd" name="bootstrapRunner" dev="nvme1n1p1" ino=151427631 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:data_t:s0:c432,c649 tclass=file permissive=0
From what I can tell, this is because the process is running as a kernel task, even though it's actually exposing data from a share from a container. My current line of thinking is that this is because it's a privileged container and actually hooking into the kernel-level support. The nfsd processes have the system_u:system_r:kernel_t:s0 SELinux context, and are not children of the NFS server pod.
What I've tried to do to work around the problem:
I've attempted to work around this problem by using EFS rather than locally hosting, but when using access points and dynamically provisioned volumes, chmod commands get permission denied, which fails many scripts (and even tar in some cases).
How to reproduce the problem:
To reproduce the problem, you can create the resources I've added below in a Kubernetes cluster that is running Bottlerocket OS 1.20.4. I have been doing this in an AWS EKS cluster.
You will be able to see the logs after running logdog from the admin container in the node running the NFS server, not the nfs-client pod. To run this reproduction, you will also need the NFS CSI driver, which you can install using helm:
helm upgrade --install --atomic \
--repo https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts \
--namespace kube-system \
--version v4.6.0 \
csi-driver-nfs \
csi-driver-nfs
If you deploy this outside of the default namespace, please adjust the server URL to instead point to the namespace you're deploying to - replace nfs.default.svc.cluster.local with nfs.<your-namespace>.svc.cluster.local.
Resources:
NFS Server
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: nfs-server
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: nfs
serviceName: "nfs"
template:
metadata:
labels:
app.kubernetes.io/name: nfs
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
- key: kubernetes.io/arch
operator: In
values:
- arm64
- amd64
containers:
- env:
- name: SHARED_DIRECTORY
value: /octopus
- name: SYNC
value: "true"
image: octopusdeploy/nfs-server:1.0.1
imagePullPolicy: IfNotPresent
name: nfs-server
ports:
- containerPort: 2049
protocol: TCP
resources:
requests:
cpu: 50m
memory: 50Mi
securityContext:
privileged: true
volumeMounts:
- mountPath: /octopus
name: octopus-volume
restartPolicy: Always
volumes:
- emptyDir:
sizeLimit: 10Gi
name: octopus-volume
updateStrategy:
type: RollingUpdate
---
apiVersion: v1
kind: Service
metadata:
name: nfs
spec:
clusterIP: None
ports:
- name: nfs
port: 2049
protocol: TCP
targetPort: 2049
selector:
app.kubernetes.io/name: nfs
sessionAffinity: None
type: ClusterIP
PV/PVC
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs-pv-10gi
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 10Gi
csi:
driver: nfs.csi.k8s.io
volumeAttributes:
server: nfs.default.svc.cluster.local
share: /
volumeHandle: nfs.default.svc.cluster.local/octopus##
mountOptions:
- nfsvers=4.1
- lookupcache=none
- soft
- timeo=50
- retrans=4
persistentVolumeReclaimPolicy: Retain
storageClassName: nfs-csi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nfs-pvc-10gi
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
storageClassName: nfs-csi
Client Pod
apiVersion: apps/v1
kind: Deployment
metadata:
name: nfs-client
spec:
selector:
matchLabels:
app: nfs-client
template:
metadata:
labels:
app: nfs-client
spec:
containers:
- name: nfs-client
image: alpine
command: ["sh"]
args:
- -c
- 'echo "echo \"hello world\"" > /octopus/runme.sh && chmod +x /octopus/runme.sh && sh -c "/octopus/runme.sh"'
resources:
limits:
memory: "128Mi"
cpu: "500m"
volumeMounts:
- mountPath: /octopus
name: mount
volumes:
- name: mount
persistentVolumeClaim:
claimName: nfs-pvc-10gi```
Thanks for the report; I am investigating, and I will let you know what I find out. In the meantime, I can offer some other persistent storage options, in case any of them would be helpful. You mention both self-hosted NFS and EFS. A few other possibilities you might consider:
- FSx for ONTAP, which can serve shares over NFS. I have been able to mount volumes I provisioned on this service from Bottlerocket-hosted containers in an EKS cluster.
- FSx for Lustre, a non-NFS file server. I have been able to mount shares both natively and using CSI drivers, also in an EKS cluster. I have not attempted to change permissions on files on either file server, nor execute files from them, so I can't guarantee that these will work for your application.
You may be running into a variation of the behavior discussed here:
For overlayfs, the mounting process credentials are saved and used for subsequent access checks from other processes, so those credentials need to grant a superset of permissions.
nfsdis running as a kernel thread with thekernel_tlabel- it's serving files from a directory with the
data_t:s0:c432,c649label (an overlayfs mount for a container) - processes with the
kernel_tlabel are blocked from executing files owned by containers
nfsd isn't actually trying to execute the binary itself (it's a kernel thread, it can't really do that); it's just having its permissions checked (because of overlayfs), and it doesn't have the execute permission, so the action is blocked.
One way to work around this might be to mount in a directory from the host's /local as a hostPath volume mount and use that as the NFS server root. That will avoid the overlayfs permission check that I suspect is causing this denial. (Other volume types should work too.)
If you can, we'd love to hear back how these suggestions are working (or not working) for you. Thanks!
Hi! Sorry for the late reply, for some reason, GitHub decided that I did not want to receive emails about this issue 🤦.
Thanks for the excellent suggestions about different RWX volume types, though since we need to support many other node types and environments, I'm uncertain if it's suitable. The most promising so far is simply using hostPath, which I'll test now and get back to you with results.
I did assume that nfsd wasn't actually attempting to execute the file, but was just an access check - thanks for linking me to the behaviour with overlayFS, this connects many of the dots for me!
Unfortunately, we still get the same issue mounting from /local
The AVC Denial:
Aug 05 23:30:45 ip-10-0-42-10.ap-southeast-2.compute.internal audit[45476]: AVC avc: denied { execute } for pid=45476 comm="nfsd" name="exec.sh" dev="nvme1n1p1" ino=18270915 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:local_t:s0 tclass=file permissive=0
The file:
bash-5.1# ls -laZi ./test/
total 4
18270913 drwxr-xr-x. 2 root root system_u:object_r:local_t:s0 21 Aug 5 23:30 .
457164 drwxr-xr-x. 5 root root system_u:object_r:local_t:s0 50 Aug 5 23:30 ..
18270915 -rwxr-xr-x. 1 root root system_u:object_r:local_t:s0 13 Aug 5 23:30 exec.sh
It still seems that nfsd is still attempting to check the permissions - I'm not sure if this is something I've done wrong in the mount? Any ideas?
It still seems that nfsd is still attempting to check the permissions [...]
I need to set up a repro case locally to try to understand what's going on with SELinux, but I expect it'll need a policy fix on the Bottlerocket side.
It still seems that nfsd is still attempting to check the permissions [...]
I need to set up a repro case locally to try to understand what's going on with SELinux, but I expect it'll need a policy fix on the Bottlerocket side.
Hi Ben! I was wondering if there's anything I could do to help repro this issue locally, or if I can help with my existing repro at all?
Hey Liam - I've been able to repro the issue using the steps you provided. Thanks for the detailed instructions.
Despite what I wrote earlier, there doesn't seem to be any overlayfs involvement here. octopus-volume is just a directory under /var/lib/kubelet/pods labeled with the pod's SELinux pair and bind-mounted in:
# grep octopus /proc/$(pgrep nfsd.sh)/mountinfo
4327 4319 259:17 /var/lib/kubelet/pods/d98aa2fb-12c2-4e16-a5d0-e829c60a490f/volumes/kubernetes.io~empty-dir/octopus-volume /octopus rw,nosuid,nodev,noatime - xfs /dev/nvme1n1p1 rw,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=8,swidth=8,noquota
I prodded at it with ftrace:
# cd /sys/kernel/tracing
echo -n 10 > max_graph_depth
echo nfsd_permission > set_graph_function
echo -n function_graph > current_tracer
cat trace
... and it just looks like a straightforward SELinux permission check failure, where nfsd checks the inode permission, which checks the SELinux permission, which says that kernel_t can't execute a data_t file:
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
1) 2.200 us | nfsd_permission [nfsd]();
1) 0.610 us | nfsd_permission [nfsd]();
------------------------------------------
1) nfsd-7795 => nfsd-7794
------------------------------------------
1) 1.120 us | nfsd_permission [nfsd]();
1) | nfsd_permission [nfsd]() {
1) | inode_permission() {
1) 0.550 us | generic_permission();
1) | security_inode_permission() {
1) | selinux_inode_permission() {
1) 0.780 us | __inode_security_revalidate();
1) 0.530 us | __rcu_read_lock();
1) 0.540 us | avc_lookup();
1) 0.540 us | __rcu_read_unlock();
1) 4.650 us | }
1) 0.560 us | bpf_lsm_inode_permission();
1) 6.720 us | }
1) 8.740 us | }
1) 9.850 us | }
1) 0.630 us | nfsd_permission [nfsd]();
Unfortunately I'm still not sure on what the best way to fix this is.
Thanks for the update, Ben! I've been able to get this working by using the userspace NFS implementation with ganesha-nfs instead of the kernel implementation, since the inode checks seem to happen in the context of the container instead of kernel.
At this point, I think the only way this would work is if nfsd ran in a different context (preferably the container exporting the mount).
I don't know enough about SELinux to tell if that's a terrible idea or not, or if that's even possible. I think we can probably close this for now, with the understanding that userspace NFS implementations are preferred.
I don't know enough about SELinux to tell if that's a terrible idea or not, or if that's even possible. I think we can probably close this for now, with the understanding that userspace NFS implementations are preferred.
I have a couple ideas that I'd like to explore, so I'm happy to keep it open until there's some kind of resolution.
For the first idea: the /opt/csi directory on the host is special-cased where privileged containers can write to it, and some host programs can execute files there. This was added in #3779 to support the S3 CSI driver. Right now only init_t can execute the files but we could potentially allow kernel_t to execute as well. The catch would be that the NFS shares would all have use a hostPath volume from under that directory, which would be annoying.
My other idea is to allow kernel_t the "execute" permission, but to have it trigger a transition to a different type, and then block that transition to prevent execution. Roughly:
# always change from "kernel_t" to "forbidden_t" when executing a "data_t" file
(typetransition kernel_t container_exec_o process forbidden_t)
# but, don't actually allow this change to take place
(neverallow kernel_t forbidden_t (processes (transform))
That would have the property that nfsd (which must run as kernel_t) would pass these inode permission checks, while still preventing the kernel from actually executing untrusted binaries (which is the policy objective, and which nfsd doesn't need to do). However, I need to write some test cases to be sure that it's doing the right thing, and still blocking what it's meant to block.
My other idea is to allow kernel_t the "execute" permission, but to have it trigger a transition to a different type, and then block that transition to prevent execution.
That's an ingenious way to solve the problem! Hopefully it works - I think it's a better fix than forcing NFS to use hostPath volumes.
Thanks for your help with this, by the way. Investigating this problem has opened my eyes a lot to how SELinux and Bottlerocket works in general, and it's definitely becoming my distro of choice for EKS!
I have this working now, or at least I think I do. I need to write some additional test cases but hope to have the policy change up for review soon.
One surprise was that the kernel will silently fall back to the current label in some cases, per this code in selinux_bprm_creds_for_exec.
I caught this when running an automated test that checks for a container escape via /proc/sys/kernel/core_pattern - the transition was denied but the test failed.
Fortunately SELinux also has an "execute with no transition" permission (exec_no_trans) that can be denied.
If I still find gaps, then I'll need to fix this in a different way, probably by using CONFIG_STATIC_USERMODEHELPER=y and adding a helper program to limit execution to specific programs or paths.
Similar issue.. Switched to Ganesha User Space NFS server. Works great.
The fix for this should be coming in 1.27.0, which is expected to be released this week.
The fix for this should be coming in 1.27.0, which is expected to be released this week.
Thanks so much, Ben! Really appreciate your work on this, and it was great to see the SELinux changes that were required to get this working, it's really helped me understand transitions a lot more! If you happen to be in SLC for Kubecon, let me know and I'll pop by and say thanks in person :)