kepler icon indicating copy to clipboard operation
kepler copied to clipboard

No value reported by Kepler latest on OpenShift

Open vprashar2929 opened this issue 1 year ago • 7 comments

What happened?

When Kepler using the latest is deployed against OpenShift 4.15(VM/BM) it does not produce any metric value. All the metric values are 0.

Some screenshots for reference:

Screenshot 2024-05-03 at 4 55 34 PM

Screenshot 2024-05-03 at 4 58 54 PM

What did you expect to happen?

Kepler should report energy values

How can we reproduce it (as minimally and precisely as possible)?

Deploy Kepler on OpenShift (VM/BM) using the latest image and then check the metrics it exports.

Anything else we need to know?

No response

Kepler image tag

latest

Kubernetes version

$ kubectl version
# paste output here

Cloud provider or bare metal

VM

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Kepler deployment config

For on kubernetes:

$ KEPLER_NAMESPACE=kepler

# provide kepler configmap
$ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} 
# paste output here

# provide kepler deployment description
$ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE} 

For standalone:

put your Kepler command argument here

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

vprashar2929 avatar May 03 '24 11:05 vprashar2929

Attaching Kepler logs for reference:

Kepler on BM: kepler-bm.log

Kepler on VM: kepler.log

vprashar2929 avatar May 06 '24 08:05 vprashar2929

cc: @rootfs

vprashar2929 avatar May 07 '24 06:05 vprashar2929

additional information from kepler logs invalid indirect read from stack R3 off -144+0 size 8

libbpf: prog 'kepler_sched_switch_trace': BPF program load failed: Permission denied
libbpf: prog 'kepler_sched_switch_trace': -- BEGIN PROG LOAD LOG --
R1 type=ctx expected=fp
0: R1=ctx(off=0,imm=0) R10=fp0
; if (SAMPLE_RATE > 0) {
0: (18) r2 = 0xffff8f7ec5a8bb10       ; R2_w=map_value(off=0,ks=4,vs=12,imm=0)
2: (61) r2 = *(u32 *)(r2 +0)          ; R2_w=0
3: (67) r2 <<= 32                     ; R2_w=0
4: (c7) r2 s>>= 32                    ; R2_w=0
5: (b7) r4 = 1                        ; R4_w=1
; if (SAMPLE_RATE > 0) {
6: (6d) if r4 s> r2 goto pc+13        ; R2_w=P0 R4_w=P1
; prev_pid = ctx->prev_pid;
20: (61) r7 = *(u32 *)(r1 +24)        ; R1=ctx(off=0,imm=0) R7_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
; prev_pid = ctx->prev_pid;
21: (63) *(u32 *)(r10 -20) = r7       ; R7_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0 fp-24=mmmm????
; pid_tgid = bpf_get_current_pid_tgid();
22: (85) call bpf_get_current_pid_tgid#14     ; R0=scalar()
23: (bf) r6 = r0                      ; R0=scalar(id=1) R6_w=scalar(id=1)
; cur_pid = pid_tgid & 0xffffffff;
24: (63) *(u32 *)(r10 -28) = r6       ; R6_w=scalar(id=1) R10=fp0 fp-32=mmmm????
; cgroup_id = bpf_get_current_cgroup_id();
25: (85) call bpf_get_current_cgroup_id#80    ; R0_w=scalar()
26: (7b) *(u64 *)(r10 -152) = r0      ; R0_w=scalar() R10=fp0 fp-152_w=mmmmmmmm
; cpu_id = bpf_get_smp_processor_id();
27: (85) call bpf_get_smp_processor_id#8      ; R0_w=scalar()
28: (bf) r9 = r0                      ; R0_w=scalar(id=2) R9_w=scalar(id=2)
; cpu_id = bpf_get_smp_processor_id();
29: (63) *(u32 *)(r10 -24) = r9       ; R9_w=scalar(id=2) R10=fp0 fp-24=mmmmmmmm
; cur_ts = bpf_ktime_get_ns();
30: (85) call bpf_ktime_get_ns#5      ; R0=scalar()
31: (bf) r8 = r0                      ; R0=scalar(id=3) R8_w=scalar(id=3)
; error = bpf_perf_event_read(&cpu_cycles_event_reader, *cpu_id);
32: (67) r9 <<= 32                    ; R9_w=scalar(smax=9223372032559808512,umax=18446744069414584320,var_off=(0x0; 0xffffffff00000000),s32_min=0,s32_max=0,u32_max=0)
33: (77) r9 >>= 32                    ; R9_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
; error = bpf_perf_event_read(&cpu_cycles_event_reader, *cpu_id);
34: (18) r1 = 0xffff8f7ec5a8be00      ; R1_w=map_ptr(off=0,ks=4,vs=4,imm=0)
36: (bf) r2 = r9                      ; R2_w=scalar(id=4,umax=4294967295,var_off=(0x0; 0xffffffff)) R9_w=scalar(id=4,umax=4294967295,var_off=(0x0; 0xffffffff))
37: (85) call bpf_perf_event_read#22          ; R0_w=scalar()
38: (18) r1 = 0x80000000              ; R1_w=2147483648
; if (error < 0) {
40: (5f) r0 &= r1                     ; R0_w=scalar(umax=2147483648,var_off=(0x0; 0x80000000),s32_max=0) R1_w=2147483648
; if (error < 0) {
41: (55) if r0 != 0x0 goto pc+14      ; R0_w=0
42: (bf) r9 = r10                     ; R9_w=fp0 R10=fp0
; 
43: (07) r9 += -24                    ; R9_w=fp-24
; prev_val = bpf_map_lookup_elem(&cpu_cycles, cpu_id);
44: (18) r1 = 0xffff8f7ec5a89800      ; R1_w=map_ptr(off=0,ks=4,vs=8,imm=0)
46: (bf) r2 = r9                      ; R2_w=fp-24 R9_w=fp-24
47: (85) call bpf_map_lookup_elem#1   ; R0=map_value_or_null(id=5,off=0,ks=4,vs=8,imm=0)
48: (bf) r3 = r10                     ; R3_w=fp0 R10=fp0
; 
49: (07) r3 += -144                   ; R3_w=fp-144
; bpf_map_update_elem(&cpu_cycles, cpu_id, &val, BPF_ANY);
50: (18) r1 = 0xffff8f7ec5a89800      ; R1_w=map_ptr(off=0,ks=4,vs=8,imm=0)
52: (bf) r2 = r9                      ; R2_w=fp-24 R9=fp-24
53: (b7) r4 = 0                       ; R4_w=0
54: (85) call bpf_map_update_elem#2
invalid indirect read from stack R3 off -144+0 size 8
processed 37 insns (limit 1000000) max_states_per_insn 0 total_states 3 peak_states 3 mark_read 2
-- END PROG LOAD LOG --
libbpf: prog 'kepler_sched_switch_trace': failed to load: -13
libbpf: failed to load object '/var/lib/kepler/bpfassets/kepler.bpfel.o'
libbpf: prog 'kepler_sched_switch_trace': can't attach BPF program w/o FD (did you load it?)
libbpf: prog 'kepler_sched_switch_trace': failed to attach to tracepoint 'sched/sched_switch': Invalid argument
I0507 07:06:05.224958   53215 libbpf_attacher.go:136] failed to attach tracepoint/sched/sched_switch: failed to attach program: invalid argument
libbpf: prog 'kepler_irq_trace': can't attach BPF program w/o FD (did you load it?)
libbpf: prog 'kepler_irq_trace': failed to attach to tracepoint 'irq/softirq_entry': Invalid argument
W0507 07:06:05.246909   53215 libbpf_attacher.go:149] could not attach irq/softirq_entry: failed to attach tracepoint softirq_entry to program kepler_irq_trace: invalid argument
libbpf: prog 'kepler_write_page_trace': can't attach BPF program w/o FD (did you load it?)
libbpf: prog 'kepler_write_page_trace': failed to attach to tracepoint 'writeback/writeback_dirty_folio': Invalid argument
W0507 07:06:05.262985   53215 libbpf_attacher.go:162] failed to attach tp/writeback/writeback_dirty_folio: failed to attach tracepoint writeback_dirty_folio to program kepler_write_page_trace: invalid argument. Kepler will not collect page cache write events. This will affect the DRAM power model estimation on VMs.
libbpf: prog 'kepler_read_page_trace': can't attach before loaded
W0507 07:06:05.262996   53215 libbpf_attacher.go:172] failed to attach fentry/mark_page_accessed: failed to attach program: invalid argument. Kepler will not collect page cache read events. This will affect the DRAM power model estimation on VMs.
W0507 07:06:05.263055   53215 libbpf_attacher.go:187] could not attach perf event cpu_instructions_event_reader: failed to update bpf map: failed to update map cpu_instructions_event_reader: bad file descriptor. Are you using a VM?
W0507 07:06:05.263112   53215 libbpf_attacher.go:187] could not attach perf event cache_miss_event_reader: failed to update bpf map: failed to update map cache_miss_event_reader: bad file descriptor. Are you using a VM?
W0507 07:06:05.263120   53215 libbpf_attacher.go:187] could not attach perf event task_clock_ms_event_reader: failed to update bpf map: failed to update map task_clock_ms_event_reader: bad file descriptor. Are you using a VM?
W0507 07:06:05.263130   53215 libbpf_attacher.go:187] could not attach perf event cpu_cycles_event_reader: failed to update bpf map: failed to update map cpu_cycles_event_reader: bad file descriptor. Are you using a VM?
I0507 07:06:05.263133   53215 libbpf_attacher.go:195] Successfully load eBPF module from libbpf object

vimalk78 avatar May 07 '24 07:05 vimalk78

static inline u64 get_on_cpu_cycles(u32 *cpu_id)
{       
        u64 delta, val, *prev_val;
        struct bpf_perf_event_value c;
        int error;

        // TODO: Fix Verifier errors upon changing this to bpf_perf_event_read_value
        error = bpf_perf_event_read(&cpu_cycles_event_reader, *cpu_id);

u64 bpf_perf_event_read(struct bpf_map *map, u64 flags) second argument to bpf_perf_event_read needs u64 but we are passing *cpu_id which is u32 is this the reason why verifier fails?

Cc: @dave-tucker

vimalk78 avatar May 07 '24 08:05 vimalk78

@vimalk78 sorry the verifier error mentioned in that TODO was not the error mentioned in the comment. I'll open an issue to add some more context around that one.

In this case the verifier error is coming from here: https://github.com/sustainable-computing-io/kepler/blob/main/bpfassets/libbpf/src/kepler.bpf.c#L139

My best guess would be that there is some incompatibility between - the value of cpu_id (i.e the map key) and the number of entries in the map. Some logging around CPU detection and map resizing might help narrow this down further. In the meantime I'm going to try and reproduce this today.

dave-tucker avatar May 07 '24 08:05 dave-tucker

Tested loading the kepler probes from main on both RHEL 9.2 (Kernel 5.14) and RHEL 9.4 but I was unable to reproduce this error.

dave-tucker avatar May 07 '24 11:05 dave-tucker

@vimalk78 I've tested this again on RHEL 9.2 and I was able to reproduce it. Fix is in #1407

dave-tucker avatar May 08 '24 12:05 dave-tucker