kepler icon indicating copy to clipboard operation
kepler copied to clipboard

wrong kepler version info printed in logs

Open vimalk78 opened this issue 1 year ago • 1 comments

What happened?

run the following command

$ podman run -it --rm --entrypoint /bin/bash quay.io/sustainable_computing_io/kepler:latest-dcgm
[root@614b25af39dd /]# kepler 
I0227 06:33:48.397548      12 gpu.go:47] Trying to initialize GPU collector using dcgm
I0227 06:33:53.375510      12 gpu.go:54] Error initializing dcgm: not able to connect to DCGM localhost:5555: Error connecting to nv-hostengine: Host engine connection invalid/disconnected
I0227 06:33:53.375523      12 gpu.go:47] Trying to initialize GPU collector using nvidia-nvml
I0227 06:33:53.375626      12 gpu.go:54] Error initializing nvidia-nvml: failed to init nvml. ERROR_LIBRARY_NOT_FOUND
I0227 06:33:53.375630      12 gpu.go:47] Trying to initialize GPU collector using dummy
I0227 06:33:53.375632      12 gpu.go:51] Using dummy to obtain gpu power
I0227 06:33:53.378948      12 qat.go:35] Failed to init qat-telemtry err: could not get qat status exit status 127
I0227 06:33:53.383307      12 exporter.go:155] Kepler running on version: 1.20.10

It shows below log line

I0227 06:33:53.383307      12 exporter.go:155] Kepler running on version: 1.20.10

What did you expect to happen?

if a kepler is built locally, it shows correct version

 $ ./_output/bin/kepler 
I0227 11:59:36.264833 1109112 qat.go:35] Failed to init qat-telemtry err: could not get qat status exit status 127
I0227 11:59:36.270717 1109112 gpu.go:47] Trying to initialize GPU collector using dcgm
I0227 11:59:36.270763 1109112 gpu.go:54] Error initializing dcgm: not able to connect to DCGM localhost:5555: libdcgm.so not Found
I0227 11:59:36.270767 1109112 gpu.go:47] Trying to initialize GPU collector using nvidia-nvml
I0227 11:59:36.270803 1109112 gpu.go:54] Error initializing nvidia-nvml: failed to init nvml. ERROR_LIBRARY_NOT_FOUND
I0227 11:59:36.270806 1109112 gpu.go:47] Trying to initialize GPU collector using dummy
I0227 11:59:36.270809 1109112 gpu.go:51] Using dummy to obtain gpu power
I0227 11:59:36.271701 1109112 exporter.go:155] Kepler running on version: v0.7.2-110-gc4b8791d-dirty

How can we reproduce it (as minimally and precisely as possible)?

as shown above

Anything else we need to know?

No response

Kepler image tag

latest, latest-dcgm

Kubernetes version

not dependent on kubernetes version

Cloud provider or bare metal

not dependent

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Kepler deployment config

For on kubernetes:

$ KEPLER_NAMESPACE=kepler

# provide kepler configmap
$ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} 
# paste output here

# provide kepler deployment description
$ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE} 

For standalone:

put your Kepler command argument here

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

vimalk78 avatar Feb 27 '24 07:02 vimalk78

The Makefile picks up current version from the git repo. so perhaps during the build workflows, the repo is something different than kepler repo?

vimalk78 avatar Feb 27 '24 07:02 vimalk78