CephFS keyring requires nonsensicaly enormous and insecure privileges to work
Describe the bug
Right now (I tried this with many combinations) the smallest working caps that work with CephFS storage class are those:
- mon: 'allow r'
- osd: 'allow rw tag cephfs metadata=fs_k8s, allow rw tag cephfs data=fs_k8s'
- mds: 'allow r fsname=fs_k8s path=/volumes, allow rws fsname=fs_k8s path=/volumes/k8s_pb'
- mgr: 'allow rw'
That is with dedicated FS "fs_k8s" namespace that is however intended to be shared by multiple separate clusters.
Removing or reducing any bit in any way results in errors like
Warning ProvisioningFailed 2m2s (x13 over 16m) cephfs.csi.ceph.com_ceph-csi-cephfs-provisioner-756d7bb54f-z5pr7_eda038b5-5ace-4e38-940b-68bfe6c76e31 failed to provision volume with StorageClass "ceph-cephfs-sc": rpc error: code = Internal desc = rados: ret=-1, Operation not permitted
Those permissions on low level grant:
- Full MDS read access to entire /volumes
- Full access to MGR granting ability to manipulate ANY FS namespace on the cluster and access many other "admin-only" features
- Full low level (rados) access to all data pools of the given FS namespace (we have dedicated OSD pool for each k8s cluster, but this keyring gives access to all of them). And what is equally bad - full unrestricted access to metadata pool (which seems completely unnecessary) - so this keyring is technically capable of reading or modifying data of any other cluster or FS namespace user.
This is enormous security hole that makes isolation within same FS namespace impossible. Only way to workaround this is to install dedicated CEPH cluster for each CephFS CSI consumer.
You can also create a dedicated FS namespace with own MDS, but that still doesn't prevent the CSI keyring from abusing the MGR rw caps.
Why are such enormous privileges needed? It's perfectly possible to work with CephFS without any access (even read-only is not needed) for metadata pool as only MDS is supposed to access that. RW OSD access is only needed for data pools that are used by folders that cluster subvolume group is mapped to, no need to map all of them.
MGR rw caps are probably needed to access to MGR API for subvolume management, but most of those operations can be handled via alternative ways, like .snap folders for snapshot creation.
Basically list of unnecessary permissions:
- mgr: no need for rw at all
- mds: no need for r for entire /volumes
- osd: no need for any access to metadata pool or unrelated data pools of the FS namespace
This is a big security obstacle if you want to create secure environment
Environment details
- Image/version of Ceph CSI driver : 3.8.1
- Helm chart version :
- Kernel version : 5.15.0-206.153.7.1
- Mounter used for mounting PVC (for cephFS its
fuseorkernel. for rbd itskrbdorrbd-nbd) :fuse - Kubernetes cluster version : v1.25.16
- Ceph cluster version : 18.2.2
Steps to reproduce
Steps to reproduce the behavior:
Try to create keyring that is restricted only to specific data pool, with no access to metadatapool, or mgr. CephFS is going to be mountable and usable just fine with such keyring, but CephFS Storage class is going to be unusable (only permission denied for anything)
Actual results
Getting permission denied unless the keyring has almost admin-like caps
Expected behavior
Storage class should not require admin-like caps to work with CephFS. Regular restricted caps should be enough.
Logs
Normal Provisioning 2m2s (x13 over 16m) cephfs.csi.ceph.com_ceph-csi-cephfs-provisioner-756d7bb54f-z5pr7_eda038b5-5ace-4e38-940b-68bfe6c76e31 External provisioner is provisioning volume for claim "monitoring-system/prometheus-prometheus-prometheus-db-prometheus-prometheus-prometheus-0"
Warning ProvisioningFailed 2m2s (x13 over 16m) cephfs.csi.ceph.com_ceph-csi-cephfs-provisioner-756d7bb54f-z5pr7_eda038b5-5ace-4e38-940b-68bfe6c76e31 failed to provision volume with StorageClass "ceph-cephfs-sc": rpc error: code = Internal desc = rados: ret=-1, Operation not permitted
Additional context
This was already discussed in https://github.com/ceph/ceph-csi/issues/1818#issuecomment-1057467489
Hi @benapetr,
Except for permissions to work with CephFS, Ceph-CSI needs to store additional metadata for mapping of (CSI) volume-handles to CephFS details. This metadata is stored directly in Rados OMAPs, which should explain the need for extra permissions.
If there is a reduced permission set that allows to work with CephFS and Rados, we obviously would appreciate guidance in dropping unneeded capabilities.
Details about the required capabilities are documented in docs/capabilities.md.
So does that mean that only safe and truly isolated way to allow multiple k8s clusters to use CephFS is to build dedicated CEPH cluster for each k8s cluster? That is indeed not very efficient.
@benapetr
So does that mean that only safe and truly isolated way to allow multiple k8s clusters to use CephFS is to build dedicated CEPH cluster for each k8s cluster? That is indeed not very efficient.
Multitenancy for each k8s cluster on a single ceph filesystem will be possible, with the PR #4652 . I'm still working on it and as soon it gets merged i can finish it.
@nixpanic
About mgr capabilities of ceph-csi:
https://github.com/ceph/ceph-csi/blob/d5849a4801fa9f3383cefc4f96e18fcad420dc80/internal/cephfs/core/volume.go#L161
This would call in ceph-go: https://github.com/ceph/go-ceph/blob/1046b034a1f618f67acd3c6523482917e27c7113/cephfs/admin/subvolume.go#L270-L275
And they can be limited via for each tenant via ceph capabilities allow command group_name prefix '<tenant-subvolumegroup>'.
Those are hopefully all the commands ceph-csi uses internally based on the source code.
allow command 'fs subvolume resize' with vol_name prefix 'k8s-fs' group_name prefix '<subvolumegroup>' sub_name prefix 'csi-', \
allow command 'fs subvolume rm' with vol_name prefix 'k8s-fs' group_name prefix '<subvolumegroup>' sub_name prefix 'csi-', \
allow command 'fs subvolume create' with vol_name prefix 'k8s-fs' group_name prefix '<subvolumegroup>' sub_name prefix 'csi-', \
allow command 'fs subvolume snapshot create' with vol_name prefix 'k8s-fs' group_name prefix '<subvolumegroup>' sub_name prefix 'csi-', \
allow command 'fs subvolume snapshot rm' with vol_name prefix 'k8s-fs' group_name prefix '<subvolumegroup>' sub_name prefix 'csi-', \
allow command 'fs subvolume snapshot clone' with vol_name prefix 'k8s-fs' group_name prefix '<subvolumegroup>' sub_name prefix 'csi-', \
allow command 'fs subvolume snapshot metadata set' with vol_name prefix 'k8s-fs' group_name prefix '<subvolumegroup>' sub_name prefix 'csi-', \
allow command 'fs subvolume snapshot metadata rm' with vol_name prefix 'k8s-fs' group_name prefix '<subvolumegroup>' sub_name prefix 'csi-', \
allow command 'fs subvolume metadata set' with vol_name prefix 'k8s-fs' group_name prefix '<subvolumegroup>' sub_name prefix 'csi-', \
allow command 'fs subvolume metadata rm' with vol_name prefix 'k8s-fs' group_name prefix '<subvolumegroup>' sub_name prefix 'csi-', \
allow command 'fs subvolume getpath' with vol_name prefix 'k8s-fs' group_name prefix '<subvolumegroup>' sub_name prefix 'csi-', \
allow command 'fs subvolume ls' with vol_name prefix 'k8s-fs' group_name prefix '<subvolumegroup>', \
allow command 'fs subvolume info' with vol_name prefix 'k8s-fs' group_name prefix '<subvolumegroup>' sub_name prefix 'csi-', \
allow command 'fs subvolume snapshot info' with vol_name prefix 'k8s-fs' group_name prefix '<subvolumegroup>' sub_name prefix 'csi-', \
allow command 'fs clone status' with vol_name prefix 'k8s-fs' group_name prefix '<subvolumegroup>', \
allow command 'fs volume ls', \
allow command 'fs dump', \
allow command 'fs ls'" \
See also https://docs.ceph.com/en/latest/rados/operations/user-management/
´Manager capabilities can also be specified for specific commands, for all commands exported by a built-in manager service, or for all commands exported by a specific add-on module.´
Is this something you are willing to maintain that in future ceph-csi releases this list of mgr caps gets updated?
Maybe they can be generated from source for each release?
So i had some time. About other caps OSD/MDS and the last pieces:
export CLUSTER="my-k8s-cluster-1"
# create subvolumegroup for cluster
ceph fs subvolumegroup create k8s-fs $CLUSTER
# set specific radosNamespace for data written to the subvolumegroup
setfattr -n ceph.dir.layout.pool_namespace -v $CLUSTER /cephfs/volumes/$CLUSTER
Caps for OSD / MDS
osd "allow rw pool=k8s-fs-data namespace=$CLUSTER, allow rw pool=k8s-fs-metadata namespace=$CLUSTER"
mds "allow rw fsname=k8s-fs path=/volumes/$CLUSTER"
So whats happening, we ask CephFS with setfattr to place any data written to the subvolume in an specifc radosNamespace.
Metadata access is limited to the path /volumes/$CLUSTER.
Until the PR is merged the caps for osd must be:
osd "allow rw pool=k8s-fs-data namespace=$CLUSTER, allow rw pool=k8s-fs-metadata namespace=csi"
With this caps we should reach multitenancy for CephFS.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.