kepler icon indicating copy to clipboard operation
kepler copied to clipboard

Kepler not reporting correct process name in metrics

Open vprashar2929 opened this issue 1 year ago • 6 comments

What happened?

When Kepler using the latest deployed on a machine currently it reports the wrong process name in the exported metrics.

Attaching some screenshots for reference:

  • Actual process name and PID running on the system:
ps -ef | grep 75577
qemu       75577       1  8 Apr15 ?        01:10:07 /usr/bin/qemu-system-x86_64 -name guest=fedora39,debug-threads=on -S 

Output from pstree command:

pstree -p | grep qemu
           |-qemu-system-x86(75577)-+-{qemu-system-x86}(75605)
           |                        |-{qemu-system-x86}(75617)
           |                        |-{qemu-system-x86}(75618)
           |                        |-{qemu-system-x86}(75619)
           |                        |-{qemu-system-x86}(75620)
           |                        |-{qemu-system-x86}(75622)
           |                        |-{qemu-system-x86}(109718)
           |                        |-{qemu-system-x86}(109719)
           |                        |-{qemu-system-x86}(109720)
           |                        `-{qemu-system-x86}(109721)
  • Value reported by kepler_process_platform_joules_total for the particular pid 75577 that is command="CPU 0/KVM" which is wrong

Screenshot 2024-04-16 at 1 21 16 PM

What did you expect to happen?

Kepler should report the correct command name in the metrics that it exports.

How can we reproduce it (as minimally and precisely as possible)?

Run Kepler either on Kubernetes or using the docker-compose locally which is present here: https://github.com/sustainable-computing-io/kepler/tree/main/hackdocker-compose

Anything else we need to know?

No response

Kepler image tag

latest

Kubernetes version

$ kubectl version
# paste output here

Cloud provider or bare metal

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Kepler deployment config

For on kubernetes:

$ KEPLER_NAMESPACE=kepler

# provide kepler configmap
$ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} 
# paste output here

# provide kepler deployment description
$ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE} 

For standalone:

put your Kepler command argument here

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

vprashar2929 avatar Apr 16 '24 07:04 vprashar2929

I know why this is 🎉 See: https://github.com/sustainable-computing-io/kepler/blob/main/bpfassets/libbpf/src/kepler.bpf.c#L247C3-L247C23

As @vimalk78 found out, from eBPF we record the:

  • PID (as seen by the kernel)
  • TGID (as seen by the kernel)
  • Comm

From the perspective of userland, the PID is actually what the kernel calls the TGID - you'll notice that we accidentally on-purpose switch the order of these fields in the definition of the struct: https://github.com/sustainable-computing-io/kepler/blob/main/pkg/bpf/types.go#L49-L50

TL:DR the comm that we record belongs to the pid (as the kernel sees it, not as userland sees it), so you will indeed get values like CPU 0/KVM.

I think the fix required here is going to be either:

  1. Don't record the comm from eBPF and look it up from procfs instead
  2. Only set the comm if pid == tgid

I'm going to try and verify this theory on my development machine at some point later this week.

dave-tucker avatar May 21 '24 14:05 dave-tucker

@vprashar2929 is this still an issue?

Ref: https://github.com/sustainable-computing-io/kepler/issues/1640

vimalk78 avatar Aug 29 '24 07:08 vimalk78

closing as the issue is addressed and fixed

vprashar2929 avatar Sep 02 '24 05:09 vprashar2929

reopening the issue as Kepler latest still reports the process name as incorrect: Screenshot 2024-09-13 at 2 25 23 PM

vprashar2929 avatar Sep 13 '24 08:09 vprashar2929

what is expected process name in above test?

vimalk78 avatar Sep 13 '24 08:09 vimalk78

❯ pstree -p | grep qemu
           |-qemu-system-x86(110356)-+-{qemu-system-x86}(110367)
           |                         |-{qemu-system-x86}(110370)
           |                         |-{qemu-system-x86}(110371)
           |                         |-{qemu-system-x86}(110372)
           |                         |-{qemu-system-x86}(110373)
           |                         |-{qemu-system-x86}(110374)
           |                         |-{qemu-system-x86}(110375)
           |                         |-{qemu-system-x86}(110377)
           |                         `-{qemu-system-x86}(2178213)

vprashar2929 avatar Sep 13 '24 09:09 vprashar2929