bottlerocket icon indicating copy to clipboard operation
bottlerocket copied to clipboard

EKS PersistentVolume accessMode ReadWriteOnce doesn't function correctly

Open rkt2spc opened this issue 1 year ago • 2 comments

Image I'm using:

bottlerocket-aws-k8s-1.22-x86_64-v1.9.2-b8074d44

What I expected to happen:

According to PersistentVolume access modes spec

ReadWriteOnce the volume can be mounted as read-write by a single node. ReadWriteOnce access mode still can allow multiple pods to access the volume when the pods are running on the same node.

I expect multiple pods on the same node to be able to read/write to the same PV in parallel (This behavior is implemented correctly in EKS optimized Amazon Linux 2 images).

What actually happened:

The volume was mounted successfully to multiple pods on a same node (meaning pods are scheduled and running, in parallel).

Only the first pod can actually read/write the content of the volume, the latter pods get permission denied.

How to reproduce the problem:

Apply these manifests on a EKS cluster with EBS provisioner installed, or create the PersistentVolume yourself. You must use nodes with bottlerocket as the backing image, ofc!

Observe the logs of 2 pods, one successfully list the content of the volume, the other one get permission denied.

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: debug-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: ebs-sc
---
apiVersion: v1
kind: Pod
metadata:
  name: debug-pod-1
spec:
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: debug-pvc
  containers:
    - name: debug
      image: alpine:3
      volumeMounts:
        - mountPath: /data
          name: data
      command: ["/bin/sh"]
      args: ["-c", "while true; do ls -al /data && echo ----- && sleep 5; done"]
---
apiVersion: v1
kind: Pod
metadata:
  name: debug-pod-2
spec:
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: debug-pvc
  containers:
    - name: debug
      image: alpine:3
      volumeMounts:
        - mountPath: /data
          name: data
      command: ["/bin/sh"]
      args: ["-c", "while true; do ls -al /data && echo ----- && sleep 5; done"]

rkt2spc avatar Sep 14 '22 14:09 rkt2spc

Potentially related to https://github.com/bottlerocket-os/bottlerocket/discussions/1747

rkt2spc avatar Sep 14 '22 15:09 rkt2spc

I expect you're correct that this is related to #1747. dmesg on the node would show a lot of AVC denials.

EBS volume mounts currently get labeled with the MCS pair for the first pod, and subsequent pods won't be able to write to them. There's not really a difference between "two unrelated, unprivileged pods on the host mount the same volume" and "one untrusted, unprivileged pod mounts the volume for another unprivileged pod and modifies it unexpectedly."

That means ReadWriteOnce volumes on Bottlerocket are going to act like ReadWriteOncePod volumes instead.

You can work around this by specifying seLinuxOptions so that all the pods that are meant to share a volume end up with the same process label:

seLinuxOptions:
  level: s0:c1,c2,c3

It would be better for ReadWriteOnce volumes to behave as documented, but that's challenging for backwards compatibility reasons. I'd want the automatic behavior for ReadWriteOncePod volumes as a replacement, but it doesn't look like that will happen even with the SELinuxMountReadWriteOncePod feature enabled, since that requires pods to opt-in by specifying the label.

bcressey avatar Sep 14 '22 23:09 bcressey