security-profiles-operator
security-profiles-operator copied to clipboard
enabling eBPF Recorder on AKS crashes SPOD containers
following @saschagrunert excellent tutorial here , I have called the method :
kubectl patch spod spod --type=merge -p '{"spec":{"enableBpfRecorder":true}}'
which eventually led to the following output on the bpf-recorder container :
I0129 19:09:14.865625 27546 logr.go:252] "msg"="Set logging verbosity to 1" I0129 19:09:14.865684 27546 logr.go:252] "msg"="Profiling support enabled: false" I0129 19:09:14.865733 27546 logr.go:252] setup "msg"="starting component: bpf-recorder" "buildDate"="1980-01-01T00:00:00Z" "compiler"="gc" "gitCommit"="unknown" "gitTreeState"="clean" "goVersion"="go1.17.3" "libbpf"="0.5.0" "libseccomp"="2.5.2" "platform"="linux/amd64" "version"="0.5.0-dev" I0129 19:09:14.865789 27546 bpfrecorder.go:106] bpf-recorder "msg"="Setting up caches with expiry of 1h0m0s" I0129 19:09:14.865820 27546 bpfrecorder.go:123] bpf-recorder "msg"="Starting log-enricher on node: aks-primary-29748022-vmss000002" I0129 19:09:14.866518 27546 bpfrecorder.go:154] bpf-recorder "msg"="Connecting to metrics server" I0129 19:09:14.867108 27546 bpfrecorder.go:170] bpf-recorder "msg"="Got system mount namespace: 4026531840" I0129 19:09:14.867126 27546 bpfrecorder.go:172] bpf-recorder "msg"="Doing BPF load/unload self-test" I0129 19:09:14.867139 27546 bpfrecorder.go:371] bpf-recorder "msg"="Loading bpf module" I0129 19:09:14.867162 27546 bpfrecorder.go:440] bpf-recorder "msg"="Using system btf file" I0129 19:09:14.867382 27546 bpfrecorder.go:391] bpf-recorder "msg"="Loading bpf object from module" libbpf: map 'events': failed to create: Invalid argument(-22) libbpf: failed to load object 'recorder.bpf.o' E0129 19:09:14.871501 27546 logr.go:270] setup "msg"="running security-profiles-operator" "error"="load self-test: load bpf object: failed to load BPF object"
- Cloud provider or hardware configuration: Azure AKS version 1.21.7
- OS : Linux
- Kernel (e.g.
uname -a
): 5.4.0-1067-azure - Others: containerd://1.4.9+azure
kubectl get nodes -o wide
❯ k get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
aks-primary-29748022-vmss000000 Ready agent 27h v1.21.7 10.240.0.4 <none> Ubuntu 18.04.6 LTS 5.4.0-1067-azure containerd://1.4.9+azure
aks-primary-29748022-vmss000001 Ready agent 27h v1.21.7 10.240.0.5 <none> Ubuntu 18.04.6 LTS 5.4.0-1067-azure containerd://1.4.9+azure
aks-primary-29748022-vmss000002 Ready agent 27h v1.21.7 10.240.0.6 <none> Ubuntu 18.04.6 LTS 5.4.0-1067-azure containerd://1.4.9+azure
I can reproduce it and we probably should update libbpf and the vendored btf to see if that fixes the issue.
Did a test with https://github.com/kubernetes-sigs/security-profiles-operator/pull/796 and it does not work, because:
- we usually fallback to the in-memory btf (which is now available within that patch) if no vmlinux is exposed. The vmlinux file is available on the azure node, so it should work in theory with that file.
- forcing it to use the in-memory btf fails with the same error
That's odd, I'm not sure if the kernel configuration of the azure nodes are correct to support our BPF application.
great , i’m grateful for the time you invested in that
LMK if more checks / changes are needed from my end .
Tomer
On Mon, 31 Jan 2022 at 11:52 Sascha Grunert @.***> wrote:
I can reproduce it and we probably should update libbpf and the vendored btf to see if that fixes the issue.
— Reply to this email directly, view it on GitHub https://github.com/kubernetes-sigs/security-profiles-operator/issues/795#issuecomment-1025554861, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD5FRNSXB32CRC7GYVOOUDUYZLVHANCNFSM5NDHGRVA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you authored the thread.Message ID: @.*** com>
-- Sent from Gmail Mobile
@tshaiman can you share the configuration flags how the kernel has been built? Ubuntu 18.04 does not expose /sys/kernel/btf/vmlinux
per default.
When trying the llvm-bootstrap demo application: https://github.com/libbpf/libbpf-bootstrap/blob/master/examples/c/bootstrap.c
Then I'm getting the same error on an azure node (I disabled the failure on RLIMIT_MEMLOCK
increasing):
root@aks-agentpool-41851968-vmss000001:~/libbpf-bootstrap/examples/c# ./bootstrap
Failed to increase RLIMIT_MEMLOCK limit!
libbpf: map 'rb': failed to create: Invalid argument(-22)
libbpf: failed to load object 'bootstrap_bpf'
libbpf: failed to load BPF skeleton 'bootstrap_bpf': -22
Failed to load and verify BPF skeleton
# uname -a
Linux aks-agentpool-41851968-vmss000001 5.4.0-1067-azure #70~18.04.1-Ubuntu SMP Thu Jan 13 19:46:01 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
@saschagrunert : I don't have insights on how the kernel was built as I'm not part of the AKS team.
@saschagrunert : I don't have insights on how the kernel was built as I'm not part of the AKS team.
Maybe we can open an issue in their tracker to describe the problem there?
@saschagrunert : done https://github.com/Azure/AKS/issues/2768
Still the same with the latest Azure deployment.
correct, as I still see the bug here : https://github.com/Azure/AKS/issues/2768 is still open . I send a ping/reminder on the ticket
still pending on AKS, i have reminded them many times . ps it could be related to https://github.com/Azure/AKS/issues/2827
Can reproduce this issue on GKE cos nodes too. Error logs:
Found 6 pods, using pod/spod-sl7n4
I0315 20:32:49.511742 167526 logr.go:252] "msg"="Set logging verbosity to 0"
I0315 20:32:49.511798 167526 logr.go:252] "msg"="Profiling support enabled: false"
I0315 20:32:49.511881 167526 logr.go:252] setup "msg"="starting component: bpf-recorder" "buildDate"="1980-01-01T00:00:00Z" "compiler"="gc" "gitCommit"="67f1c871de542881ea397058874fc020c604198e" "gitTreeState"="dirty" "goVersion"="go1.17.6" "libbpf"="0.6.1" "libseccomp"="2.5.3" "platform"="linux/amd64" "version"="0.4.2-dev"
I0315 20:32:49.511934 167526 bpfrecorder.go:105] bpf-recorder "msg"="Setting up caches with expiry of 1h0m0s"
I0315 20:32:49.511957 167526 bpfrecorder.go:122] bpf-recorder "msg"="Starting log-enricher on node: gke-sam-cluster-2-pool-2-0f1e4876-rr52"
I0315 20:32:49.512901 167526 bpfrecorder.go:153] bpf-recorder "msg"="Connecting to metrics server"
I0315 20:32:49.513778 167526 bpfrecorder.go:169] bpf-recorder "msg"="Got system mount namespace: 4026531840"
I0315 20:32:49.513798 167526 bpfrecorder.go:171] bpf-recorder "msg"="Doing BPF load/unload self-test"
I0315 20:32:49.513815 167526 bpfrecorder.go:370] bpf-recorder "msg"="Loading bpf module"
I0315 20:32:49.513839 167526 bpfrecorder.go:439] bpf-recorder "msg"="Using system btf file"
I0315 20:32:49.514097 167526 bpfrecorder.go:390] bpf-recorder "msg"="Loading bpf object from module"
libbpf: map 'events': failed to create: Invalid argument(-22)
libbpf: failed to load object 'recorder.bpf.o'
E0315 20:32:49.520079 167526 logr.go:270] setup "msg"="running security-profiles-operator" "error"="load self-test: load bpf object: failed to load BPF object"
Not related to GKE, but maybe BTF Hub can help with the AKS case?
Not related to GKE, but maybe BTF Hub can help with the AKS case?
AKS already exposes /sys/kernel/btf/vmlinux
which should contain the correct BTF information. I think I tried manually using the internally provided BTF, but this had the same effect.
I have the same issue and I deploy SPO in my local cluster, is this concerned as kernel problem?
my OS is centos with kernel:
Linux k8s-master-node-1 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
AKS doesn't do anything special for our kernels. They are based on Azure marketplace Ubuntu 18.04 images. Have you tried to reproduce this on a vanilla Azure (non-AKS) VM?
FWIW, I tried this once like a year or so ago and hit similar issues I wasn't able to resolve. I ended up not using BTF sadly. My suspicion is it could be something to do with 18.04 or how they backported kernel fixes and a version like 20.04 originally based on 5.x+ could work out of the box (i.e., something is wonky between 4.15 + 18.04 and 5.4 + 18.04 because 4.15 didn't support BTF, but later kernels did). But TBH, I am not an expert in BTF, and I don't think we are doing anything special here, so I'm not sure where to investigate.
@tshaiman can you share the configuration flags how the kernel has been built? Ubuntu 18.04 does not expose /sys/kernel/btf/vmlinux per default.
@saschagrunert in case it's helpful, attached the kconfig from a running AKS node. Notably I do see CONFIG_DEBUG_INFO_BTF=y
which is interesting (don't think that used to be the case in original 18.04, possibly came with kernel bump in 18.04.5 or whatever latest patch is?).
here's a snippet of the config grepping for bpf/btf flags to save you some time (admittedly 1067 vs 1074 but they are basically the same)
/# cat /boot/config-5.4.0-1074-azure | grep "B[T|P]F"
CONFIG_CGROUP_BPF=y
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT_ALWAYS_ON=y
CONFIG_BPF_UNPRIV_DEFAULT_OFF=y
CONFIG_IPV6_SEG6_BPF=y
CONFIG_NETFILTER_XT_MATCH_BPF=m
CONFIG_BPFILTER=y
CONFIG_BPFILTER_UMH=m
CONFIG_NET_CLS_BPF=m
CONFIG_NET_ACT_BPF=m
CONFIG_BPF_JIT=y
CONFIG_BPF_STREAM_PARSER=y
CONFIG_LWTUNNEL_BPF=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_DEBUG_INFO_BTF=y
CONFIG_BPF_EVENTS=y
CONFIG_BPF_KPROBE_OVERRIDE=y
@alexeldeib : thanks a lot for assisting in getting those configs . my 2 cents here is the latest kernel config which is 5.4.0.1074 from my running AKS node. config-5.4.0-1074-azure.txt
ah. I suspect you need kernel 5.8 (not available on ubuntu 18.04 or AKS yet)
https://github.com/kubernetes-sigs/security-profiles-operator/blob/79bfa1db8a1d4fab8fbcada717df083f3b2a3bbf/internal/pkg/daemon/bpfrecorder/bpf/recorder.bpf.c#L28
https://github.com/torvalds/linux/commit/457f44363a8894135c85b7a9afd2bd8196db24ab
https://github.com/iovisor/bcc/blob/master/docs/kernel-versions.md
ring buffer maps are only in 5.8.
see also https://github.com/libbpf/libbpf-bootstrap/issues/42 which is due to the same issue: https://github.com/libbpf/libbpf-bootstrap/blob/a08b97804db0bcde6c6eca45cef58e436288fe34/examples/c/bootstrap.bpf.c#L19
that is indeed seems to be the root cause, well done @alexeldeib ! @saschagrunert : do you think an alternative to ring buffer maps < kernel 5.8 can be used ? It might assist other developers mentioning having the same compatibilities issues.
@tshaiman maybe, I'll see what we can do here.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
@saschagrunert I'm facing the same issue on a k3s cluster
spod-f7bzv 2/3 CrashLoopBackOff 38 (4m42s ago) 134m
$ k logs spod-f7bzv -c bpf-recorder
I0720 16:07:53.028594 59631 bpfrecorder.go:105] bpf-recorder "msg"="Setting up caches with expiry of 1h0m0s"
I0720 16:07:53.028604 59631 bpfrecorder.go:122] bpf-recorder "msg"="Starting log-enricher on node: setcisedtp0013.hosting.cegedim.cloud"
I0720 16:07:53.029242 59631 bpfrecorder.go:153] bpf-recorder "msg"="Connecting to metrics server"
I0720 16:07:53.030180 59631 bpfrecorder.go:173] bpf-recorder "msg"="Got system mount namespace: 4026531840"
I0720 16:07:53.030190 59631 bpfrecorder.go:175] bpf-recorder "msg"="Doing BPF load/unload self-test"
I0720 16:07:53.030195 59631 bpfrecorder.go:374] bpf-recorder "msg"="Loading bpf module"
I0720 16:07:53.030211 59631 bpfrecorder.go:443] bpf-recorder "msg"="Using system btf file"
I0720 16:07:53.030592 59631 bpfrecorder.go:394] bpf-recorder "msg"="Loading bpf object from module"
libbpf: map 'events': failed to create: Invalid argument(-22)
libbpf: failed to load object 'recorder.bpf.o'
E0720 16:07:53.034027 59631 logr.go:279] setup "msg"="running security-profiles-operator" "error"="load self-test: load bpf object: failed to load BPF object"
Environment:
NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.4 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
uname -a:
Linux -- 5.4.0-110-generic #124-Ubuntu SMP Thu Apr 14 19:46:19 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
@B3ns44d I think we require kernel 5.8 for that to work :-/
@saschagrunert ohhh didn't know that, it now functions properly after upgrading to 5.13.0-41.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
This seems to have been answered with the kernel version comment. Closing.