kepler
kepler copied to clipboard
wrong kepler version info printed in logs
What happened?
run the following command
$ podman run -it --rm --entrypoint /bin/bash quay.io/sustainable_computing_io/kepler:latest-dcgm
[root@614b25af39dd /]# kepler
I0227 06:33:48.397548 12 gpu.go:47] Trying to initialize GPU collector using dcgm
I0227 06:33:53.375510 12 gpu.go:54] Error initializing dcgm: not able to connect to DCGM localhost:5555: Error connecting to nv-hostengine: Host engine connection invalid/disconnected
I0227 06:33:53.375523 12 gpu.go:47] Trying to initialize GPU collector using nvidia-nvml
I0227 06:33:53.375626 12 gpu.go:54] Error initializing nvidia-nvml: failed to init nvml. ERROR_LIBRARY_NOT_FOUND
I0227 06:33:53.375630 12 gpu.go:47] Trying to initialize GPU collector using dummy
I0227 06:33:53.375632 12 gpu.go:51] Using dummy to obtain gpu power
I0227 06:33:53.378948 12 qat.go:35] Failed to init qat-telemtry err: could not get qat status exit status 127
I0227 06:33:53.383307 12 exporter.go:155] Kepler running on version: 1.20.10
It shows below log line
I0227 06:33:53.383307 12 exporter.go:155] Kepler running on version: 1.20.10
What did you expect to happen?
if a kepler is built locally, it shows correct version
$ ./_output/bin/kepler
I0227 11:59:36.264833 1109112 qat.go:35] Failed to init qat-telemtry err: could not get qat status exit status 127
I0227 11:59:36.270717 1109112 gpu.go:47] Trying to initialize GPU collector using dcgm
I0227 11:59:36.270763 1109112 gpu.go:54] Error initializing dcgm: not able to connect to DCGM localhost:5555: libdcgm.so not Found
I0227 11:59:36.270767 1109112 gpu.go:47] Trying to initialize GPU collector using nvidia-nvml
I0227 11:59:36.270803 1109112 gpu.go:54] Error initializing nvidia-nvml: failed to init nvml. ERROR_LIBRARY_NOT_FOUND
I0227 11:59:36.270806 1109112 gpu.go:47] Trying to initialize GPU collector using dummy
I0227 11:59:36.270809 1109112 gpu.go:51] Using dummy to obtain gpu power
I0227 11:59:36.271701 1109112 exporter.go:155] Kepler running on version: v0.7.2-110-gc4b8791d-dirty
How can we reproduce it (as minimally and precisely as possible)?
as shown above
Anything else we need to know?
No response
Kepler image tag
latest, latest-dcgm
Kubernetes version
not dependent on kubernetes version
Cloud provider or bare metal
not dependent
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
Kepler deployment config
For on kubernetes:
$ KEPLER_NAMESPACE=kepler
# provide kepler configmap
$ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE}
# paste output here
# provide kepler deployment description
$ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE}
For standalone:
put your Kepler command argument here
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The Makefile picks up current version from the git repo. so perhaps during the build workflows, the repo is something different than kepler repo?