bottlerocket Cannot execute binaries stored in an NFS Server running on a Bottlerocket node

Image I'm using: Bottlerocket OS 1.20.4 (aws-k8s-1.30)

Context: We have some software that runs multiple pods for multiple stages in a pipeline. To be able to complete this dynamically and allow retries on specific steps, we spawn short-lived pods that connect to an NFS server running in-cluster for its ephemeral data. A typical installation would have the orchestrator and the NFS server to begin with. When the orchestrator receives a piece of work, it will:

Create a subfolder in the NFS server, and download any required executables to it
Spawn a pod, which runs the executables in the subfolder
Once the pod has finished running, the subfolder gets removed

The NFS server is a simple variant of this alpine server.

What I expected to happen: When running an NFS Server in a container in bottlerocket, you are able to execute files on the share from a mount in a different container.

What actually happened: The NFSD process is denied execute access. This is exhibited in this AVC denial log:

Jul 26 01:20:06 ip-10-0-19-55.ap-southeast-2.compute.internal audit[2830356]: AVC avc:  denied  { execute } for  pid=2830356 comm="nfsd" name="bootstrapRunner" dev="nvme1n1p1" ino=151427631 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:data_t:s0:c432,c649 tclass=file permissive=0

From what I can tell, this is because the process is running as a kernel task, even though it's actually exposing data from a share from a container. My current line of thinking is that this is because it's a privileged container and actually hooking into the kernel-level support. The nfsd processes have the system_u:system_r:kernel_t:s0 SELinux context, and are not children of the NFS server pod.

What I've tried to do to work around the problem: I've attempted to work around this problem by using EFS rather than locally hosting, but when using access points and dynamically provisioned volumes, chmod commands get permission denied, which fails many scripts (and even tar in some cases).

How to reproduce the problem: To reproduce the problem, you can create the resources I've added below in a Kubernetes cluster that is running Bottlerocket OS 1.20.4. I have been doing this in an AWS EKS cluster.

You will be able to see the logs after running logdog from the admin container in the node running the NFS server, not the nfs-client pod. To run this reproduction, you will also need the NFS CSI driver, which you can install using helm:

helm upgrade --install --atomic \
--repo https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts \
--namespace kube-system \
--version v4.6.0 \
csi-driver-nfs \
csi-driver-nfs

If you deploy this outside of the default namespace, please adjust the server URL to instead point to the namespace you're deploying to - replace nfs.default.svc.cluster.local with nfs.<your-namespace>.svc.cluster.local.

Resources:

NFS Server

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: nfs-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: nfs
  serviceName: "nfs"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: nfs
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/os
                operator: In
                values:
                - linux
              - key: kubernetes.io/arch
                operator: In
                values:
                - arm64
                - amd64
      containers:
      - env:
        - name: SHARED_DIRECTORY
          value: /octopus
        - name: SYNC
          value: "true"
        image: octopusdeploy/nfs-server:1.0.1
        imagePullPolicy: IfNotPresent
        name: nfs-server
        ports:
        - containerPort: 2049
          protocol: TCP
        resources:
          requests:
            cpu: 50m
            memory: 50Mi
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /octopus
          name: octopus-volume
      restartPolicy: Always
      volumes:
      - emptyDir:
          sizeLimit: 10Gi
        name: octopus-volume
  updateStrategy:
    type: RollingUpdate
---
apiVersion: v1
kind: Service
metadata:
  name: nfs
spec:
  clusterIP: None
  ports:
  - name: nfs
    port: 2049
    protocol: TCP
    targetPort: 2049
  selector:
    app.kubernetes.io/name: nfs
  sessionAffinity: None
  type: ClusterIP

PV/PVC

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-pv-10gi
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 10Gi
  csi:
    driver: nfs.csi.k8s.io
    volumeAttributes:
      server: nfs.default.svc.cluster.local
      share: /
    volumeHandle: nfs.default.svc.cluster.local/octopus##
  mountOptions:
  - nfsvers=4.1
  - lookupcache=none
  - soft
  - timeo=50
  - retrans=4
  persistentVolumeReclaimPolicy: Retain
  storageClassName: nfs-csi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-pvc-10gi
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  storageClassName: nfs-csi

Client Pod

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nfs-client
spec:
  selector:
    matchLabels:
      app: nfs-client
  template:
    metadata:
      labels:
        app: nfs-client
    spec:
      containers:
      - name: nfs-client
        image: alpine
        command: ["sh"]
        args: 
        - -c
        - 'echo "echo \"hello world\"" > /octopus/runme.sh && chmod +x /octopus/runme.sh && sh -c "/octopus/runme.sh"'
        resources:
          limits:
            memory: "128Mi"
            cpu: "500m"
        volumeMounts:
        - mountPath: /octopus
          name: mount
      volumes:
      - name: mount
        persistentVolumeClaim:
          claimName: nfs-pvc-10gi```

Jul 28 '24 10:07 liam-mackie

Thanks for the report; I am investigating, and I will let you know what I find out. In the meantime, I can offer some other persistent storage options, in case any of them would be helpful. You mention both self-hosted NFS and EFS. A few other possibilities you might consider:

FSx for ONTAP, which can serve shares over NFS. I have been able to mount volumes I provisioned on this service from Bottlerocket-hosted containers in an EKS cluster.
FSx for Lustre, a non-NFS file server. I have been able to mount shares both natively and using CSI drivers, also in an EKS cluster. I have not attempted to change permissions on files on either file server, nor execute files from them, so I can't guarantee that these will work for your application.

Jul 30 '24 18:07 larvacea

You may be running into a variation of the behavior discussed here:

For overlayfs, the mounting process credentials are saved and used for subsequent access checks from other processes, so those credentials need to grant a superset of permissions.

nfsd is running as a kernel thread with the kernel_t label
it's serving files from a directory with the data_t:s0:c432,c649 label (an overlayfs mount for a container)
processes with the kernel_t label are blocked from executing files owned by containers

nfsd isn't actually trying to execute the binary itself (it's a kernel thread, it can't really do that); it's just having its permissions checked (because of overlayfs), and it doesn't have the execute permission, so the action is blocked.

One way to work around this might be to mount in a directory from the host's /local as a hostPath volume mount and use that as the NFS server root. That will avoid the overlayfs permission check that I suspect is causing this denial. (Other volume types should work too.)

Jul 30 '24 18:07 bcressey

If you can, we'd love to hear back how these suggestions are working (or not working) for you. Thanks!

Jul 31 '24 21:07 larvacea

Hi! Sorry for the late reply, for some reason, GitHub decided that I did not want to receive emails about this issue 🤦. Thanks for the excellent suggestions about different RWX volume types, though since we need to support many other node types and environments, I'm uncertain if it's suitable. The most promising so far is simply using hostPath, which I'll test now and get back to you with results. I did assume that nfsd wasn't actually attempting to execute the file, but was just an access check - thanks for linking me to the behaviour with overlayFS, this connects many of the dots for me!

Aug 05 '24 20:08 liam-mackie

Unfortunately, we still get the same issue mounting from /local The AVC Denial:

Aug 05 23:30:45 ip-10-0-42-10.ap-southeast-2.compute.internal audit[45476]: AVC avc:  denied  { execute } for  pid=45476 comm="nfsd" name="exec.sh" dev="nvme1n1p1" ino=18270915 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:local_t:s0 tclass=file permissive=0

The file:

bash-5.1# ls -laZi  ./test/
total 4
18270913 drwxr-xr-x. 2 root root system_u:object_r:local_t:s0 21 Aug  5 23:30 .
  457164 drwxr-xr-x. 5 root root system_u:object_r:local_t:s0 50 Aug  5 23:30 ..
18270915 -rwxr-xr-x. 1 root root system_u:object_r:local_t:s0 13 Aug  5 23:30 exec.sh

It still seems that nfsd is still attempting to check the permissions - I'm not sure if this is something I've done wrong in the mount? Any ideas?

Aug 06 '24 03:08 liam-mackie

It still seems that nfsd is still attempting to check the permissions [...]

I need to set up a repro case locally to try to understand what's going on with SELinux, but I expect it'll need a policy fix on the Bottlerocket side.

Aug 19 '24 16:08 bcressey

It still seems that nfsd is still attempting to check the permissions [...]

I need to set up a repro case locally to try to understand what's going on with SELinux, but I expect it'll need a policy fix on the Bottlerocket side.

Hi Ben! I was wondering if there's anything I could do to help repro this issue locally, or if I can help with my existing repro at all?

Sep 02 '24 23:09 liam-mackie

Hey Liam - I've been able to repro the issue using the steps you provided. Thanks for the detailed instructions.

Despite what I wrote earlier, there doesn't seem to be any overlayfs involvement here. octopus-volume is just a directory under /var/lib/kubelet/pods labeled with the pod's SELinux pair and bind-mounted in:

# grep octopus /proc/$(pgrep nfsd.sh)/mountinfo
4327 4319 259:17 /var/lib/kubelet/pods/d98aa2fb-12c2-4e16-a5d0-e829c60a490f/volumes/kubernetes.io~empty-dir/octopus-volume /octopus rw,nosuid,nodev,noatime - xfs /dev/nvme1n1p1 rw,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=8,swidth=8,noquota

I prodded at it with ftrace:

# cd /sys/kernel/tracing
echo -n 10 > max_graph_depth
echo nfsd_permission > set_graph_function
echo -n function_graph > current_tracer
cat trace

... and it just looks like a straightforward SELinux permission check failure, where nfsd checks the inode permission, which checks the SELinux permission, which says that kernel_t can't execute a data_t file:

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 1)   2.200 us    |  nfsd_permission [nfsd]();
 1)   0.610 us    |  nfsd_permission [nfsd]();
 ------------------------------------------
 1)   nfsd-7795    =>   nfsd-7794
 ------------------------------------------

 1)   1.120 us    |  nfsd_permission [nfsd]();
 1)               |  nfsd_permission [nfsd]() {
 1)               |    inode_permission() {
 1)   0.550 us    |      generic_permission();
 1)               |      security_inode_permission() {
 1)               |        selinux_inode_permission() {
 1)   0.780 us    |          __inode_security_revalidate();
 1)   0.530 us    |          __rcu_read_lock();
 1)   0.540 us    |          avc_lookup();
 1)   0.540 us    |          __rcu_read_unlock();
 1)   4.650 us    |        }
 1)   0.560 us    |        bpf_lsm_inode_permission();
 1)   6.720 us    |      }
 1)   8.740 us    |    }
 1)   9.850 us    |  }
 1)   0.630 us    |  nfsd_permission [nfsd]();

Unfortunately I'm still not sure on what the best way to fix this is.

Sep 12 '24 00:09 bcressey

Thanks for the update, Ben! I've been able to get this working by using the userspace NFS implementation with ganesha-nfs instead of the kernel implementation, since the inode checks seem to happen in the context of the container instead of kernel.

At this point, I think the only way this would work is if nfsd ran in a different context (preferably the container exporting the mount).

I don't know enough about SELinux to tell if that's a terrible idea or not, or if that's even possible. I think we can probably close this for now, with the understanding that userspace NFS implementations are preferred.

Sep 12 '24 01:09 liam-mackie

I don't know enough about SELinux to tell if that's a terrible idea or not, or if that's even possible. I think we can probably close this for now, with the understanding that userspace NFS implementations are preferred.

I have a couple ideas that I'd like to explore, so I'm happy to keep it open until there's some kind of resolution.

For the first idea: the /opt/csi directory on the host is special-cased where privileged containers can write to it, and some host programs can execute files there. This was added in #3779 to support the S3 CSI driver. Right now only init_t can execute the files but we could potentially allow kernel_t to execute as well. The catch would be that the NFS shares would all have use a hostPath volume from under that directory, which would be annoying.

My other idea is to allow kernel_t the "execute" permission, but to have it trigger a transition to a different type, and then block that transition to prevent execution. Roughly:

# always change from "kernel_t" to "forbidden_t"  when executing a "data_t" file 
(typetransition kernel_t container_exec_o process forbidden_t)

# but, don't actually allow this change to take place 
(neverallow kernel_t forbidden_t (processes (transform))

That would have the property that nfsd (which must run as kernel_t) would pass these inode permission checks, while still preventing the kernel from actually executing untrusted binaries (which is the policy objective, and which nfsd doesn't need to do). However, I need to write some test cases to be sure that it's doing the right thing, and still blocking what it's meant to block.

Sep 12 '24 18:09 bcressey

My other idea is to allow kernel_t the "execute" permission, but to have it trigger a transition to a different type, and then block that transition to prevent execution.

That's an ingenious way to solve the problem! Hopefully it works - I think it's a better fix than forcing NFS to use hostPath volumes.

Thanks for your help with this, by the way. Investigating this problem has opened my eyes a lot to how SELinux and Bottlerocket works in general, and it's definitely becoming my distro of choice for EKS!

Sep 12 '24 19:09 liam-mackie

I have this working now, or at least I think I do. I need to write some additional test cases but hope to have the policy change up for review soon.

One surprise was that the kernel will silently fall back to the current label in some cases, per this code in selinux_bprm_creds_for_exec.

I caught this when running an automated test that checks for a container escape via /proc/sys/kernel/core_pattern - the transition was denied but the test failed.

Fortunately SELinux also has an "execute with no transition" permission (exec_no_trans) that can be denied.

If I still find gaps, then I'll need to fix this in a different way, probably by using CONFIG_STATIC_USERMODEHELPER=y and adding a helper program to limit execution to specific programs or paths.

Oct 11 '24 22:10 bcressey

Similar issue.. Switched to Ganesha User Space NFS server. Works great.

Nov 04 '24 16:11 snowzach

The fix for this should be coming in 1.27.0, which is expected to be released this week.

Nov 11 '24 18:11 bcressey

The fix for this should be coming in 1.27.0, which is expected to be released this week.

Thanks so much, Ben! Really appreciate your work on this, and it was great to see the SELinux changes that were required to get this working, it's really helped me understand transitions a lot more! If you happen to be in SLC for Kubecon, let me know and I'll pop by and say thanks in person :)

Nov 11 '24 18:11 liam-mackie