Agent K8S workload attestor support for cgroups v2
- Version: 1.5.5+
- Platform: K8S with amd64 linux kernel using cgroups v2
- Subsystem: Agent k8s workload attestor
Currently k8s workload attestor is extracting podUID and containerUID from /proc/<...pid...>/cgroup https://github.com/spiffe/spire/blob/main/pkg/agent/plugin/workloadattestor/k8s/k8s.go#L206 https://github.com/spiffe/spire/blob/main/pkg/agent/common/cgroups/cgroups.go#L31
This works for cgroups v1, but it does not work with cgroup v2 For v2 /proc/<...pid...>/cgroup will always be: 0::/ Which results in no containerUID extracted and thus no k8s workload attestation possible
Are you planning to migrate it to support cgroups v2? e.g. looking at some solutions it seems its possible to extract containerUID from /proc/self/mountinfo
Background:
- https://man7.org/linux/man-pages/man7/cgroups.7.html#CGROUPS_VERSION_1
- https://man7.org/linux/man-pages/man7/cgroups.7.html#CGROUPS_VERSION_2
- https://docs.kernel.org/admin-guide/cgroup-v2.html
- https://stackoverflow.com/questions/68816329/how-to-get-docker-container-id-from-within-the-container-with-cgroup-v2
Thank you for opening this @tihhoni , and for sharing the links - they're very helpful
I see that kubelet will use cgroups v2 by default any time that it is enabled. Do you have a sense for which or how many distributions are shipping with cgroups v2 enabled currently? Did you bump into this issue directly?
Here's some info that I was able to find regarding which distros are already being shipped with groups v2 enabled by default:
- Container Optimized OS (since M97)
- Ubuntu (since 21.10)
- Debian GNU/Linux (since Debian 11 Bullseye)
- Fedora (since 31)
- Arch Linux (since April 2021)
- RHEL and RHEL-like distributions (since 9)
Ref: https://kubernetes.io/blog/2022/08/31/cgroupv2-ga-1-25/#how-do-you-use-cgroup-v2
Hi @evan2645
From various sources i've got the feeling that cgroup v1 will be deprecated EOY 2023. So presumably new kernel OS updates will come exclusively with cgroup v2 support.
We've run into this issue in staging cluster, where OS upgrade automatically rolled out new linux distribution and tests started to fail.
I am still running tests to understand what impact that will cause. What i know so far:
- spire agent
runPID does not have any container/pod details in /proc/<...pid...>/cgroup - this makes spire agent unaware that it runs in k8s, so attempt to use CLI e.g.
./bin/spire-agent api fetchwill always result in "No identity issued" with empty selectors - surprisingly when i use a "client" from tutorial that does "watch" svids that one has container/pod UID in /proc/<...pid...>/cgroup https://github.com/spiffe/spire-tutorials/blob/main/k8s/quickstart/client-deployment.yaml
So it seems that PID cgroup might have container/pod uid info in some cases, i just did not figure out what that condition is.
If i get more results i will surely post it here.
-- Nikolai
After some testing it seems that issue is not that dire:
spire-agent run PID = 302967 busybox sleep PID = 329596
Both run pods
hostPID: true
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
k exec -it busybox -- /bin/sh
/ # cat /proc/302967/cgroup
0::/../../kubepods-besteffort-pod02b5aa9b_9013_49b5_9755_efeb06552752.slice/cri-containerd-222a5371beed8c0ad58e23d1cd39e13bc113915a385d3fea4510e1cd404868c2.scope
/ # cat /proc/329596/cgroup
0::/
k exec -it ds/spire-agent -c spire-agent -- /bin/sh
/opt/spire # cat /proc/302967/cgroup
0::/
/opt/spire # cat /proc/329596/cgroup
0::/../../kubepods-besteffort-podb2dce23b_8b3a_4356_b903_56152d488916.slice/cri-containerd-cf6d9de603ab1f665fc84f88c22f761678fcaae86b98a3936e0b9717740146aa.scope
So from within the container if you try to read your own cgroup - its root without container id. But if you read other process cgroup its fine.
As a result spire agent can attest other k8s pids but not itself.
This might be different depending on the cluster setup.
Thank you very much @tihhoni for looking into this further ❤️
Can you tell if the problem only affects "self", or if it affects e.g. all pods running as root? I'm marking this as priority/backlog for now, but if it affects all root pods, we may consider upgrading it to urgent. Thanks again!
We've noticed similar changes will be needed for the dockerd-workloadattestor to support cgroups v2. We have a project that will require cgroups v2 support in dockerd-workloadattestor beginning in May 2023. After a brief discussion in the contributor sync today, we reached a consensus that it makes sense to introduce logic to add this support across several PRs:
- introduce logic to support cgroups v2
- migrate k8s workload attestor to use the cgroups v2 logic
- migrate dockerd workload attestor to use the cgroups v2 logic
@evan2645 my impression was only "self" is the problem. I tested full ingress/egress envoy setup and attestation works fine (spire server can successfully get envoy pids container ids)
Forgot to mention: we run things in containerd 1.6.* and k8s 1.25.* i did not check other runtimes
migrate k8s workload attestor to use the cgroups v2 logic
v1 will also need to be supported for some time. This is probably what you meant, but want to make sure.
I am again running into this issue, but now from different perspective. just wanted to share (more related to cgroupsns but in all cases running on cgroups v2)
Seems that tools are slowly moving away from cgroup v1 and also using cgroupns=private (e.g. last docker daemon and kind releases) https://docs.docker.com/engine/reference/commandline/dockerd/#daemon https://github.com/kubernetes-sigs/kind/releases/tag/v0.20.0
this creates a problem for running some integration tests where you have cgroupns hardcoded https://github.com/spiffe/spire/blob/main/test/integration/suites/nested-rotation/00-setup#L32
if we run tests with dind in k8s (cgroupns=private)
docker cgroups that agent workload attester sees looks like /../<id> (--cgroup-parent=/actions_job on docker deamon does nothing here)
if we run tests with dind cgroupsns=host + --cgroup-parent=/actions_job then they work fine but then we face issues with kind cluster tests that mess up main k8s kubelet cgroup
we can patch CGROUP_MATCHERS for us for tests of course (maybe its good to allow users to set it too, default matcher /docker/<id> does not work in this case too)
just a heads up for now. if your GA VM will switch docker daemon to cgroupns=private on cgroups v2
@nikotih thank you very much for reporting this ❤️ would you mind filing it as a new issue specifically around cgroupns=private? I think that warrants separate attention and it will be good to triage it independently (but still in relation to this one)
Sure thing i can. However, i think cgroups v2 and cgroups namespace private go hand in hand.
cgroups v2 enabled kubelet and docker daemon will automatically start with namespace private (unless explicitly changed)
original issue that i opened here, actual root cause is also namespace private:
docker info
Client:
Version: 24.0.7
Context: desktop-linux
...
Server:
Cgroup Driver: cgroupfs
Cgroup Version: 2
...
Discalimer: Non systemd environment
docker run --rm alpine cat /proc/1/cgroup
0::/
docker run --rm --cgroupns=private alpine cat /proc/1/cgroup
0::/
docker run --rm --cgroupns=host alpine cat /proc/1/cgroup
0::/docker/46f093fa43676f4cfda0fe1fbf49a6c959f56e22808e2592e1d1e7bd0366ed0c
aka for spire agent on cgroups v2 and namespace private, it cannot deduce its own container
starting 2 containers with
docker run -it --rm --cgroupns=private --pid=host alpine /bin/sh
PIDS:
2809 root 0:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 3c217b7f3a40352850d757d3f5f0e263731425b0350
2835 root 0:00 /bin/sh
2977 root 0:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 67b44b40c304559915db0b688f60e83555c40f26e1d
3003 root 0:00 /bin/sh
First image:
/ # cat /proc/2835/cgroup
0::/
/ # cat /proc/3003/cgroup
0::/../67b44b40c304559915db0b688f60e83555c40f26e1d4712f0b7dc91dafe3ddd4
Second image:
/ # cat /proc/2835/cgroup
0::/../3c217b7f3a40352850d757d3f5f0e263731425b03505dd73420d9825ec76d9b2
/ # cat /proc/3003/cgroup
0::/
Same thing:
- its own container it sees as root
- other containers it sees not as
/docker/<id>but as/../<id>
Ok, I appreciate this @nikotih . I don't have the time to dig in right now but I'll move this issue back to triage so we can reassess the impact.