bottlerocket
bottlerocket copied to clipboard
EKS PersistentVolume accessMode ReadWriteOnce doesn't function correctly
Image I'm using:
bottlerocket-aws-k8s-1.22-x86_64-v1.9.2-b8074d44
What I expected to happen:
According to PersistentVolume access modes spec
ReadWriteOnce the volume can be mounted as read-write by a single node. ReadWriteOnce access mode still can allow multiple pods to access the volume when the pods are running on the same node.
I expect multiple pods on the same node to be able to read/write to the same PV in parallel (This behavior is implemented correctly in EKS optimized Amazon Linux 2 images).
What actually happened:
The volume was mounted successfully to multiple pods on a same node (meaning pods are scheduled and running, in parallel).
Only the first pod can actually read/write the content of the volume, the latter pods get permission denied.
How to reproduce the problem:
Apply these manifests on a EKS cluster with EBS provisioner installed, or create the PersistentVolume yourself. You must use nodes with bottlerocket as the backing image, ofc!
Observe the logs of 2 pods, one successfully list the content of the volume, the other one get permission denied.
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: debug-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: ebs-sc
---
apiVersion: v1
kind: Pod
metadata:
name: debug-pod-1
spec:
volumes:
- name: data
persistentVolumeClaim:
claimName: debug-pvc
containers:
- name: debug
image: alpine:3
volumeMounts:
- mountPath: /data
name: data
command: ["/bin/sh"]
args: ["-c", "while true; do ls -al /data && echo ----- && sleep 5; done"]
---
apiVersion: v1
kind: Pod
metadata:
name: debug-pod-2
spec:
volumes:
- name: data
persistentVolumeClaim:
claimName: debug-pvc
containers:
- name: debug
image: alpine:3
volumeMounts:
- mountPath: /data
name: data
command: ["/bin/sh"]
args: ["-c", "while true; do ls -al /data && echo ----- && sleep 5; done"]
Potentially related to https://github.com/bottlerocket-os/bottlerocket/discussions/1747
I expect you're correct that this is related to #1747. dmesg
on the node would show a lot of AVC denials.
EBS volume mounts currently get labeled with the MCS pair for the first pod, and subsequent pods won't be able to write to them. There's not really a difference between "two unrelated, unprivileged pods on the host mount the same volume" and "one untrusted, unprivileged pod mounts the volume for another unprivileged pod and modifies it unexpectedly."
That means ReadWriteOnce
volumes on Bottlerocket are going to act like ReadWriteOncePod volumes instead.
You can work around this by specifying seLinuxOptions
so that all the pods that are meant to share a volume end up with the same process label:
seLinuxOptions:
level: s0:c1,c2,c3
It would be better for ReadWriteOnce
volumes to behave as documented, but that's challenging for backwards compatibility reasons. I'd want the automatic behavior for ReadWriteOncePod
volumes as a replacement, but it doesn't look like that will happen even with the SELinuxMountReadWriteOncePod feature enabled, since that requires pods to opt-in by specifying the label.