"privileged: true" in pod spec clobbers SELinux options
Image I'm using:
aws-k8s-1.28
What I expected to happen: I ran a pod with this security context:
securityContext:
privileged: true
seLinuxOptions:
type: super_t
I expected the pod's process to have the super_t label on the running system.
What actually happened:
The pod's process had the control_t label instead.
How to reproduce the problem: See above.
This happens because the SELinux label is removed by containerd's CRI implementation if the container is privileged. This is similar to how seccomp filters are treated.
Normally this is fine since privileged: true implies "all the privileges" on most distros, just not on Bottlerocket.
The workaround is to avoid specifying privileged: true in the security context, and to instead list out everything that is implied by that:
securityContext:
allowPrivilegeEscalation: true
capabilities:
add:
- AUDIT_CONTROL
- BLOCK_SUSPEND
- DAC_READ_SEARCH
- IPC_LOCK
- IPC_OWNER
- LEASE
- LINUX_IMMUTABLE
- MAC_ADMIN
- MAC_OVERRIDE
- NET_ADMIN
- NET_BROADCAST
- SYSLOG
- SYS_ADMIN
- SYS_BOOT
- SYS_MODULE
- SYS_NICE
- SYS_PACCT
- SYS_PTRACE
- SYS_RAWIO
- SYS_RESOURCE
- SYS_TIME
- SYS_TTY_CONFIG
- WAKE_ALARM
seccompProfile:
type: Unconfined
seLinuxOptions:
type: super_t
This works unless the privileged container needs access to host devices. Right now, the device cgroup is set to all devices allowed for privileged containers, and there's no way to specify the equivalent in the pod spec without privileged: true.