kepler icon indicating copy to clipboard operation
kepler copied to clipboard

since dropping Fatalf to Infof if BPF not installed on host, make the bcc fileNotFound log/error message friendlier

Open sallyom opened this issue 2 years ago • 6 comments

Deploying kepler in Kube/VMs with bcc libs not found on host instance, now that the panic is handled and kepler is running, the log message is misleading in that it seems kepler is unhealthy - in fact kepler is healthy just cannot collect performance data (I think, correct if wrong!):

$ oc logs kepler-exporter-qhf78
perf_event_open: No such file or directory
I1011 15:58:35.677590       1 bcc_attacher.go:92] failed to attach perf event cache_miss: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory

instead, info msg something like BPF is not installed on host system, kepler is running but cannot collect data ?

sallyom avatar Oct 11 '22 17:10 sallyom

related to https://github.com/sustainable-computing-io/kepler/issues/282

sallyom avatar Oct 11 '22 17:10 sallyom

Should kepler dashboard show all 0 values in this case? If not, I'm mixing up another (unknown) issue with this one, and apologies. In my env I have healthy kepler/grafana deployment but all 0 values.

sallyom avatar Oct 11 '22 18:10 sallyom

Shouldn't show all 0.... Which metrics are you seen? Is it bare-metal or VM? If it is bare-metal, are you sure that RAPL is accessible? In the logs, is bcc running?

marceloamaral avatar Oct 12 '22 05:10 marceloamaral

perf_event_open: No such file or directory I1011 15:58:35.677590 1 bcc_attacher.go:92] failed to attach perf event cache_miss: failed to open bpf perf event: no such file or directory perf_event_open: No such file or directory

is it because this is hardware counter so it's not running well on VM but it doesn't impact other ?

I saw those lines in my env but I think with DNUM_CPUS the eBPF still works partly?

I1014 01:02:30.302793       1 bcc_attacher.go:111] failed to attach perf module with options [-DNUM_CPUS=8 -DCPU_FREQ]: failed to load sched_switch: error loading BPF program: permission denied
perf_event_open: No such file or directory
I1014 01:02:32.891039       1 bcc_attacher.go:86] failed to attach perf event cpu_cycles: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I1014 01:02:33.011630       1 bcc_attacher.go:86] failed to attach perf event cpu_instr: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I1014 01:02:33.124947       1 bcc_attacher.go:86] failed to attach perf event cache_miss: failed to open bpf perf event: no such file or directory

jichenjc avatar Oct 14 '22 01:10 jichenjc

First of all, we definitely need to improve the log messages....

So this is the flow: Kepler tries to attach the bpf program by collecting both the hardware counters and CPU time. If for some reason there is a failure to collect the CPU time (of course we need to investigate it), kepler attaches a program that collects only the hardware counters....

However, you have logs saying you couldn't access any hardware counters (so far we only have 3)... So this means that the bpf program is not running smoothly, in fact it is not collecting any metrics....

marceloamaral avatar Oct 14 '22 01:10 marceloamaral

On VMs, the perf counters are not available, so the warning are a result of this environment. It would help if we add that info to the log.

rootfs avatar Oct 14 '22 13:10 rootfs

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar May 17 '23 18:05 stale[bot]

Does anyone want to improve this?

marceloamaral avatar May 18 '23 05:05 marceloamaral

I supposed this issue should be handled by https://github.com/sustainable-computing-io/kepler/pull/300 and improved in the future with the issue https://github.com/sustainable-computing-io/kepler/issues/716.

@sallyom Would you mind me closing this issue and keeping only the issue https://github.com/sustainable-computing-io/kepler/issues/716 to be tracked?

sunya-ch avatar Jun 22 '23 03:06 sunya-ch

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 21 '23 05:08 stale[bot]