kepler
kepler copied to clipboard
Kepler not reporting correct process name in metrics
What happened?
When Kepler using the latest deployed on a machine currently it reports the wrong process name in the exported metrics.
Attaching some screenshots for reference:
- Actual process name and PID running on the system:
ps -ef | grep 75577
qemu 75577 1 8 Apr15 ? 01:10:07 /usr/bin/qemu-system-x86_64 -name guest=fedora39,debug-threads=on -S
Output from pstree command:
pstree -p | grep qemu
|-qemu-system-x86(75577)-+-{qemu-system-x86}(75605)
| |-{qemu-system-x86}(75617)
| |-{qemu-system-x86}(75618)
| |-{qemu-system-x86}(75619)
| |-{qemu-system-x86}(75620)
| |-{qemu-system-x86}(75622)
| |-{qemu-system-x86}(109718)
| |-{qemu-system-x86}(109719)
| |-{qemu-system-x86}(109720)
| `-{qemu-system-x86}(109721)
- Value reported by
kepler_process_platform_joules_totalfor the particular pid75577that iscommand="CPU 0/KVM"which is wrong
What did you expect to happen?
Kepler should report the correct command name in the metrics that it exports.
How can we reproduce it (as minimally and precisely as possible)?
Run Kepler either on Kubernetes or using the docker-compose locally which is present here: https://github.com/sustainable-computing-io/kepler/tree/main/hackdocker-compose
Anything else we need to know?
No response
Kepler image tag
Kubernetes version
$ kubectl version
# paste output here
Cloud provider or bare metal
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
Kepler deployment config
For on kubernetes:
$ KEPLER_NAMESPACE=kepler
# provide kepler configmap
$ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE}
# paste output here
# provide kepler deployment description
$ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE}
For standalone:
put your Kepler command argument here
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
I know why this is 🎉 See: https://github.com/sustainable-computing-io/kepler/blob/main/bpfassets/libbpf/src/kepler.bpf.c#L247C3-L247C23
As @vimalk78 found out, from eBPF we record the:
- PID (as seen by the kernel)
- TGID (as seen by the kernel)
- Comm
From the perspective of userland, the PID is actually what the kernel calls the TGID - you'll notice that we accidentally on-purpose switch the order of these fields in the definition of the struct: https://github.com/sustainable-computing-io/kepler/blob/main/pkg/bpf/types.go#L49-L50
TL:DR the comm that we record belongs to the pid (as the kernel sees it, not as userland sees it), so you will indeed get values like CPU 0/KVM.
I think the fix required here is going to be either:
- Don't record the
commfrom eBPF and look it up from procfs instead - Only set the
commifpid == tgid
I'm going to try and verify this theory on my development machine at some point later this week.
@vprashar2929 is this still an issue?
Ref: https://github.com/sustainable-computing-io/kepler/issues/1640
closing as the issue is addressed and fixed
reopening the issue as Kepler latest still reports the process name as incorrect:
what is expected process name in above test?
❯ pstree -p | grep qemu
|-qemu-system-x86(110356)-+-{qemu-system-x86}(110367)
| |-{qemu-system-x86}(110370)
| |-{qemu-system-x86}(110371)
| |-{qemu-system-x86}(110372)
| |-{qemu-system-x86}(110373)
| |-{qemu-system-x86}(110374)
| |-{qemu-system-x86}(110375)
| |-{qemu-system-x86}(110377)
| `-{qemu-system-x86}(2178213)