bottlerocket
bottlerocket copied to clipboard
Mkfs fails even from priviledged containers
Hi,
we want to run Ondat using Bottlerocket as we have a customer requirement to support Bottlerocket. Ondat is a CSI Driver for Kubernetes that among other things implements a storage engine. Ondat orchestrates data, volume attachments, PVC mounts, etc. The Ondat storage engine runs on a Daemonset in Kubernetes. It mounts the host filesystem on /var/lib/storageos where the data from PVCs will be persisted. That means that Daemonset Pods need to be able to create special device files and create filesystems on top of the devices. We found that even though the Daemonset Pod is running with:
securityContext:
allowPrivilegeEscalation: true
capabilities:
add:
- SYS_ADMIN
privileged: true
Ondat cannot mkfs the device files.
We used the admin container to reproduce the same behaviour that the storage engine attempts and it looks like SELinux is blocking the permissions for it.
bash-5.1#
bash-5.1# mkfs.ext4 ./v.00000000-0000-0000-0000-000000001000
mke2fs 1.46.5 (30-Dec-2021)
mkfs.ext4: Permission denied while trying to determine filesystem size
The device file cannot be opened, thus the filesystem cannot be created.
[root@admin]# dd if=/.bottlerocket/rootfs/var/lib/storageos/volumes/v.00000000-0000-0000-0000-000000001000 of=/dev/null bs=4k count=1
dd: failed to open '/.bottlerocket/rootfs/var/lib/storageos/volumes/v.00000000-0000-0000-0000-000000001000': Permission denied
Context
- Image: bottlerocket-aws-k8s-1.21-x86_64-v1.9.0-159e4ced
- Cluster: bottlerocket on EKS installed with eksctl
What I expected to happen
I expect to be able to create devices and create filesystems. We think we have narrowed the issue to SELinux, so it would be great to know if what we need to do is possible using Bottlerocket.
Misc
We understand that this procedure must be possible as for instance EBS volumes are allowed to be attached to Bottlerocket and formatted.
Could you please help us understand what needs to be done?
Hello @Arau - thanks for filing an issue. ~~Unfortunately, this will not work the way you are expecting.~~
~~Part of the security features of Bottlerocket is it has a read-only root filesystem. You can read more about the design here: https://aws.amazon.com/blogs/opensource/security-features-of-bottlerocket-an-open-source-linux-based-operating-system/~~
~~This means you will not be able to write to anything under the /.bottlerocket/rootfs
path.~~
Edit: Sorry, I was just looking at the root path. /.bottlerocket/rootfs/var
apparently should work.
Can you check dmesg
on an affected node for avc
messages? Those correspond to SELinux denials and will help narrow down the issue.
Generally I'd expect - and want! - CSI drivers to work on Bottlerocket, and as you say the EBS CSI driver does, so it should be possible here.
Also typically from the admin container you would never see SELinux denials - the processes run with a highly privileged label for break-glass troubleshooting - so it's possible or even likely you won't have any avc
denials.
I'd guess it's something else like the device cgroup allowlist blocking these device nodes.
Hi,
I am working with @Arau on this and have a single node cluster which is easier to gather dmesg entries from.
The only entries in dmesg are from tcmu:
[ 681.422346] SCSI subsystem initialized
[ 702.504179] scsi host0: TCM_Loopback
[ 702.546156] tcmu daemon: command reply support 1.
[ 702.556775] scsi host0: TCM_Loopback
[ 702.557557] scsi 0:0:1:0: Direct-Access LIO-ORG TCMU device 0002 PQ: 0 ANSI: 5
[ 702.562526] sd 0:0:1:0: [sda] 2048 512-byte logical blocks: (1.05 MB/1.00 MiB)
[ 702.562529] sd 0:0:1:0: [sda] 4096-byte physical blocks
[ 702.562570] sd 0:0:1:0: [sda] Write Protect is off
[ 702.562572] sd 0:0:1:0: [sda] Mode Sense: 2f 00 00 00
[ 702.562636] sd 0:0:1:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 702.562699] sd 0:0:1:0: [sda] Optimal transfer size 131072 bytes
[ 702.609528] sd 0:0:1:0: [sda] Attached SCSI disk
[ 702.659288] sd 0:0:1:0: [sda] Synchronizing SCSI cache
[ 702.818381] tcmu daemon: command reply support 1.
[ 703.110925] tcmu daemon: command reply support 1.
[ 817.011593] scsi host0: TCM_Loopback
[ 817.012524] scsi 0:0:1:0: Direct-Access LIO-ORG TCMU device 0002 PQ: 0 ANSI: 5
[ 817.013231] sd 0:0:1:0: [sda] 41943040 512-byte logical blocks: (21.5 GB/20.0 GiB)
[ 817.013233] sd 0:0:1:0: [sda] 4096-byte physical blocks
[ 817.013258] sd 0:0:1:0: [sda] Write Protect is off
[ 817.013259] sd 0:0:1:0: [sda] Mode Sense: 2f 00 00 00
[ 817.013313] sd 0:0:1:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 817.013359] sd 0:0:1:0: [sda] Optimal transfer size 131072 bytes
[ 817.049109] sd 0:0:1:0: [sda] Attached SCSI disk
If I run mkfs -t ext4 -b 4096 -D -F -E lazy_journal_init=1,lazy_itable_init=1 /var/lib/storageos/volumes/v.00000000-0000-0000-0000-000000001000
in the Ondat container I do not see any more entries in dmesg from selinux.
As an experiment I tried to do this directly to the /dev/sda device and this worked:
# mkfs -t ext4 -b 4096 -D -F -E lazy_journal_init=1,lazy_itable_init=1 /dev/sda
mke2fs 1.45.6 (20-Mar-2020)
Discarding device blocks: done
Creating filesystem with 5242880 4k blocks and 1310720 inodes
Filesystem UUID: 17aa4119-f0f9-4734-b922-a002c43df710
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000
Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
The selinux contexts do seem to be different though:
[root@ip-192-168-76-215 volumes]# ls -Zal /dev/sda
brw-rw----. 1 root 993 system_u:object_r:any_t:s0 8, 0 Aug 3 14:52 /dev/sda
[root@ip-192-168-76-215 volumes]# ls -Zal /var/lib/storageos/volumes/v.00000000-0000-0000-0000-000000001000
brw-------. 1 root root system_u:object_r:local_t:s0 8, 0 Aug 3 14:40 /var/lib/storageos/volumes/v.00000000-0000-0000-0000-000000001000
These devices should be mapped to each other (major, minor device numbers).
I did try to relabel the device with selinux and this did log an selinux error so this does seem to be working:
[ 2233.389200] audit: type=1400 audit(1659539054.983:7): avc: denied { relabelfrom } for pid=20357 comm="chcon" name="v.00000000-0000-0000-0000-000000001000" dev="nvme2n1p1" ino=133655 scontext=system_u:system_r:control_t:s0-s0:c0.c1023 tcontext=system_u:object_r:local_t:s0 tclass=blk_file permissive=0
Are there additional logs we can enable to debug this?
Chris
Hi @bcressey,
I think we have found why Ondat cannot write. The daemonset pod mounts with Bidirectional mount propagation from the container to the host fs at /var/lib/storageos
then when a device needs to be created, Ondat creates it at /var/lib/storageos/volumes
however the initial mount used nodev
. As I understand it, the nodev
flag in the mount would avoid the use of device files from that mounted fs.
➜ bottlerocket k -n storageos exec -it storageos-node-ns9bt -- grep lib/storageos /proc/mounts
Defaulted container "storageos" out of: storageos, csi-driver-registrar, csi-liveness-probe, init (init)
/dev/nvme1n1p1 /var/lib/storageos ext4 rw,seclabel,nosuid,nodev,noatime 0 0
Is there a way for us to be able to execute that mount without that flag? I am not sure if it is containerd configuration that can be changed, or if we can apply some configuration to the kubelet.
If the volumes
directory doesn't contain other state, just device nodes, it might work to mount an additional emptyDir
volume with the medium: Memory
option set, so that a new tmpfs without the nodev
option is placed there.
Otherwise (with CAP_SYS_ADMIN
) you should be able to remount the bind mount for the directory with the dev
option:
mount -o remount,dev /var/lib/storageos
Hi,
I would like to give an update on the issue. We have been working on different ways to get the mount options set correctly. the idea of the memory medium is quite interesting but it opened a can of warms with other dependencies. We also tried with sym links to circumvent the mount inheritance, but since '/' is read only, that would work. The emptyDir couldn't work because the bind mount that the ondat container has where data is stored needs to be set. It can not be any random dir or tmp dri in the host. That is because among other things both devices alongside data reside on specific locations in the FS.
The execution of the remount from inside the Ondat container works because as you mentioned, we run with CAP_SYS_ADMIN. To productionise we tried to run it on an init container that shares most of the volume mounts as the main container. However that didn't work. I thought that an init container would share the mount table on the same namespaced filesystem. However, for a reason I don't fully understand yet, that doesn't work either, we might be missing some system bind mounts in the init container to persist the change on the host.
Finally, we decided to mitigate the issue by running the remount from the main container code at bootstrap. That is effective and successful even though it is not ideal IMO. It would be best to be able to tune or tweak those flags from a configuration and declarative point of view.
It looks like things are working now, based on the last message. Though maybe not as smoothly as originally hoped.
@bcressey any thoughts on the last comment and if there is anything that could be done from the Bottlerocket side to improve this? If not, I think we can close this issue if no further action can be taken on it.