kepler icon indicating copy to clipboard operation
kepler copied to clipboard

unknown func bpf_perf_event_read_value#55 in eBPF module since 0.7.10

Open rootfs opened this issue 1 year ago • 3 comments

What happened?

In 0.7.10 and latest, ebpf module load crashed with the following trace. It looks the root cause is unknown func bpf_perf_event_read_value#55

This doesn't happen in 0.7.8 and older.

libbpf: prog 'kepler_sched_switch_trace': -- BEGIN PROG LOAD LOG --
; if (SAMPLE_RATE > 0) {
0: (18) r2 = 0xffff9fa324c082d0
2: (61) r2 = *(u32 *)(r2 +0)
 R1=ctx(id=0,off=0,imm=0) R2_w=map_value(id=0,off=0,ks=4,vs=12,imm=0) R10=fp0
3: (67) r2 <<= 32
4: (c7) r2 s>>= 32
5: (b7) r4 = 1
; if (SAMPLE_RATE > 0) {
6: (6d) if r4 s> r2 goto pc+13
last_idx 6 first_idx 0
regs=10 stack=0 before 5: (b7) r4 = 1
last_idx 6 first_idx 0
regs=4 stack=0 before 5: (b7) r4 = 1
regs=4 stack=0 before 4: (c7) r2 s>>= 32
regs=4 stack=0 before 3: (67) r2 <<= 32
regs=4 stack=0 before 2: (61) r2 = *(u32 *)(r2 +0)
; prev_pid = ctx->prev_pid;
        app.kubernetes.io/name: kepler-exporter
20: (61) r1 = *(u32 *)(r1 +24)
21: (7b) *(u64 *)(r10 -160) = r1
; prev_pid = ctx->prev_pid;
22: (63) *(u32 *)(r10 -20) = r1
; pid_tgid = bpf_get_current_pid_tgid();
23: (85) call bpf_get_current_pid_tgid#14
24: (bf) r6 = r0
; cur_pid = pid_tgid & 0xffffffff;
25: (63) *(u32 *)(r10 -28) = r6
; cgroup_id = bpf_get_current_cgroup_id();
26: (85) call bpf_get_current_cgroup_id#80
27: (7b) *(u64 *)(r10 -184) = r0
; cpu_id = bpf_get_smp_processor_id();
28: (85) call bpf_get_smp_processor_id#8
29: (bf) r9 = r0
; cpu_id = bpf_get_smp_processor_id();
30: (63) *(u32 *)(r10 -24) = r9
; cur_ts = bpf_ktime_get_ns();
31: (85) call bpf_ktime_get_ns#5
32: (bf) r8 = r0
33: (b7) r7 = 0
; struct bpf_perf_event_value c = {};
34: (7b) *(u64 *)(r10 -128) = r7
last_idx 34 first_idx 32
regs=80 stack=0 before 33: (b7) r7 = 0
35: (7b) *(u64 *)(r10 -136) = r7
36: (7b) *(u64 *)(r10 -144) = r7
; &cpu_cycles_event_reader, *cpu_id, &c, sizeof(c));
37: (67) r9 <<= 32
38: (77) r9 >>= 32
39: (bf) r3 = r10
; prev_pid = ctx->prev_pid;
40: (07) r3 += -144
; error = bpf_perf_event_read_value(
41: (18) r1 = 0xffff9f8bacc80000
43: (bf) r2 = r9
44: (b7) r4 = 24
45: (85) call bpf_perf_event_read_value#55
unknown func bpf_perf_event_read_value#55
processed 31 insns (limit 1000000) max_states_per_insn 0 total_states 2 peak_states 2 mark_read 1

What did you expect to happen?

This happens in 0.7.10 and latest.

How can we reproduce it (as minimally and precisely as possible)?

It happens on ubuntu 5.4 kernels

Anything else we need to know?

No response

Kepler image tag

0.7.10

Kubernetes version

$ kubectl version
# paste output here

v1.27.3

Cloud provider or bare metal

kind

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
5.4.0-164-generic #181-Ubuntu SMP

Install tools

Kepler deployment config

For on kubernetes:

$ KEPLER_NAMESPACE=kepler

# provide kepler configmap
$ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} 
# paste output here

# provide kepler deployment description
$ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE} 

For standalone:

put your Kepler command argument here

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

rootfs avatar May 31 '24 14:05 rootfs

here is the explanation

rootfs avatar Jun 01 '24 23:06 rootfs

I will take this up.

sthaha avatar Jun 02 '24 23:06 sthaha

See: https://github.com/sustainable-computing-io/kepler/pull/1398

With these changes applied the minimum supported kernel version for Kepler is 5.12 due to:

bpf_read_perf_event_value - which is available in tracepoint contexts in 5.12 bpf fentry/fexit programs - which added in 5.11

I think this is a pretty reasonable trade off if you read the man page of bpf_perf_event_read_value

If you really want 5.4 then we can discuss that as it's not trivial.

dave-tucker avatar Jun 03 '24 10:06 dave-tucker

The only reference to kernel requirements I could find in the documentation appears out of date - https://sustainable-computing.io/installation/strategy/. I raised an issue (#1866) that is occurring due to running kernel v5.4 on hosts.

Please clearly document these requirements on a release-by-release basis.

Robbie558 avatar May 20 '25 08:05 Robbie558