kepler
kepler copied to clipboard
unknown func bpf_perf_event_read_value#55 in eBPF module since 0.7.10
What happened?
In 0.7.10 and latest, ebpf module load crashed with the following trace. It looks the root cause is unknown func bpf_perf_event_read_value#55
This doesn't happen in 0.7.8 and older.
libbpf: prog 'kepler_sched_switch_trace': -- BEGIN PROG LOAD LOG --
; if (SAMPLE_RATE > 0) {
0: (18) r2 = 0xffff9fa324c082d0
2: (61) r2 = *(u32 *)(r2 +0)
R1=ctx(id=0,off=0,imm=0) R2_w=map_value(id=0,off=0,ks=4,vs=12,imm=0) R10=fp0
3: (67) r2 <<= 32
4: (c7) r2 s>>= 32
5: (b7) r4 = 1
; if (SAMPLE_RATE > 0) {
6: (6d) if r4 s> r2 goto pc+13
last_idx 6 first_idx 0
regs=10 stack=0 before 5: (b7) r4 = 1
last_idx 6 first_idx 0
regs=4 stack=0 before 5: (b7) r4 = 1
regs=4 stack=0 before 4: (c7) r2 s>>= 32
regs=4 stack=0 before 3: (67) r2 <<= 32
regs=4 stack=0 before 2: (61) r2 = *(u32 *)(r2 +0)
; prev_pid = ctx->prev_pid;
app.kubernetes.io/name: kepler-exporter
20: (61) r1 = *(u32 *)(r1 +24)
21: (7b) *(u64 *)(r10 -160) = r1
; prev_pid = ctx->prev_pid;
22: (63) *(u32 *)(r10 -20) = r1
; pid_tgid = bpf_get_current_pid_tgid();
23: (85) call bpf_get_current_pid_tgid#14
24: (bf) r6 = r0
; cur_pid = pid_tgid & 0xffffffff;
25: (63) *(u32 *)(r10 -28) = r6
; cgroup_id = bpf_get_current_cgroup_id();
26: (85) call bpf_get_current_cgroup_id#80
27: (7b) *(u64 *)(r10 -184) = r0
; cpu_id = bpf_get_smp_processor_id();
28: (85) call bpf_get_smp_processor_id#8
29: (bf) r9 = r0
; cpu_id = bpf_get_smp_processor_id();
30: (63) *(u32 *)(r10 -24) = r9
; cur_ts = bpf_ktime_get_ns();
31: (85) call bpf_ktime_get_ns#5
32: (bf) r8 = r0
33: (b7) r7 = 0
; struct bpf_perf_event_value c = {};
34: (7b) *(u64 *)(r10 -128) = r7
last_idx 34 first_idx 32
regs=80 stack=0 before 33: (b7) r7 = 0
35: (7b) *(u64 *)(r10 -136) = r7
36: (7b) *(u64 *)(r10 -144) = r7
; &cpu_cycles_event_reader, *cpu_id, &c, sizeof(c));
37: (67) r9 <<= 32
38: (77) r9 >>= 32
39: (bf) r3 = r10
; prev_pid = ctx->prev_pid;
40: (07) r3 += -144
; error = bpf_perf_event_read_value(
41: (18) r1 = 0xffff9f8bacc80000
43: (bf) r2 = r9
44: (b7) r4 = 24
45: (85) call bpf_perf_event_read_value#55
unknown func bpf_perf_event_read_value#55
processed 31 insns (limit 1000000) max_states_per_insn 0 total_states 2 peak_states 2 mark_read 1
What did you expect to happen?
This happens in 0.7.10 and latest.
How can we reproduce it (as minimally and precisely as possible)?
It happens on ubuntu 5.4 kernels
Anything else we need to know?
No response
Kepler image tag
Kubernetes version
$ kubectl version
# paste output here
v1.27.3
Cloud provider or bare metal
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
Kepler deployment config
For on kubernetes:
$ KEPLER_NAMESPACE=kepler
# provide kepler configmap
$ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE}
# paste output here
# provide kepler deployment description
$ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE}
For standalone:
put your Kepler command argument here
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
here is the explanation
I will take this up.
See: https://github.com/sustainable-computing-io/kepler/pull/1398
With these changes applied the minimum supported kernel version for Kepler is 5.12 due to:
bpf_read_perf_event_value - which is available in tracepoint contexts in 5.12 bpf fentry/fexit programs - which added in 5.11
I think this is a pretty reasonable trade off if you read the man page of bpf_perf_event_read_value
If you really want 5.4 then we can discuss that as it's not trivial.
The only reference to kernel requirements I could find in the documentation appears out of date - https://sustainable-computing.io/installation/strategy/. I raised an issue (#1866) that is occurring due to running kernel v5.4 on hosts.
Please clearly document these requirements on a release-by-release basis.