kepler
kepler copied to clipboard
Can't get GPU metrics
I deployed Kepler Helm on GKE ubuntu nodes and these are logs of kepler pod
WARNING: failed to read int from file: open /sys/devices/system/cpu/cpu0/online: no such file or directory
I0603 21:11:19.492830 1 dcgm.go:70] Initializing dcgm Successful
E0603 21:11:19.492950 1 device.go:103] Device with type NVML doesn't exist
I0603 21:11:19.492985 1 dcgm.go:73] Using DCGM to obtain processor power
I0603 21:11:19.493087 1 exporter.go:103] Kepler running on version: v0.7.12-dirty
I0603 21:11:19.493168 1 config.go:293] using gCgroup ID in the BPF program: true
I0603 21:11:19.493224 1 config.go:295] kernel version: 6.8
I0603 21:11:19.493295 1 rapl_msr_util.go:129] failed to open path /dev/cpu/0/msr: no such file or directory
I0603 21:11:19.493354 1 power.go:78] Unable to obtain power, use estimate method
I0603 21:11:19.493374 1 redfish.go:169] failed to get redfish credential file path
I0603 21:11:19.494274 1 acpi.go:71] Could not find any ACPI power meter path. Is it a VM?
I0603 21:11:19.494300 1 power.go:79] using none to obtain power
I0603 21:11:19.535542 1 dcgm.go:451] Created device group "dev-grp-2025-06-03-21-11-19"
E0603 21:11:19.549839 1 dcgm.go:120] failed to set up watcher: failed to set up watcher, err Error watching fields: The third-party Profiling module returned an unrecoverable error
E0603 21:11:19.554525 1 dcgm.go:85] failed to StartupDevice: failed to set up watcher, err Error watching fields: The third-party Profiling module returned an unrecoverable error
E0603 21:11:19.554568 1 accelerator.go:161] Could not init the GPU device going to try again
I0603 21:11:25.581403 1 dcgm.go:451] Created device group "dev-grp-2025-06-03-21-11-25"
E0603 21:11:25.595984 1 dcgm.go:120] failed to set up watcher: failed to set up watcher, err Error watching fields: The third-party Profiling module returned an unrecoverable error
E0603 21:11:25.601057 1 dcgm.go:85] failed to StartupDevice: failed to set up watcher, err Error watching fields: The third-party Profiling module returned an unrecoverable error
E0603 21:11:25.601091 1 accelerator.go:161] Could not init the GPU device going to try again
I0603 21:11:31.627861 1 dcgm.go:451] Created device group "dev-grp-2025-06-03-21-11-31"
E0603 21:11:31.641282 1 dcgm.go:120] failed to set up watcher: failed to set up watcher, err Error watching fields: The third-party Profiling module returned an unrecoverable error
E0603 21:11:31.646135 1 dcgm.go:85] failed to StartupDevice: failed to set up watcher, err Error watching fields: The third-party Profiling module returned an unrecoverable error
E0603 21:11:31.646162 1 accelerator.go:161] Could not init the GPU device going to try again
I0603 21:11:37.672270 1 dcgm.go:451] Created device group "dev-grp-2025-06-03-21-11-37"
E0603 21:11:37.686624 1 dcgm.go:120] failed to set up watcher: failed to set up watcher, err Error watching fields: The third-party Profiling module returned an unrecoverable error
E0603 21:11:37.691054 1 dcgm.go:85] failed to StartupDevice: failed to set up watcher, err Error watching fields: The third-party Profiling module returned an unrecoverable error
E0603 21:11:37.691077 1 accelerator.go:161] Could not init the GPU device going to try again
I0603 21:11:43.722390 1 dcgm.go:451] Created device group "dev-grp-2025-06-03-21-11-43"
E0603 21:11:43.749124 1 dcgm.go:120] failed to set up watcher: failed to set up watcher, err Error watching fields: The third-party Profiling module returned an unrecoverable error
E0603 21:11:43.755955 1 dcgm.go:85] failed to StartupDevice: failed to set up watcher, err Error watching fields: The third-party Profiling module returned an unrecoverable error
E0603 21:11:43.756006 1 accelerator.go:161] Could not init the GPU device going to try again
It reports Metrics with DCGM source but all with zero values. DCGM exporter is running on the same system and works fine.
Driver info:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB Off | 00000000:00:04.0 Off | 0 |
| N/A 40C P0 91W / 400W | 38189MiB / 40960MiB | 100% Default |
| | | Enabled* |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-40GB Off | 00000000:00:05.0 Off | 0 |
| N/A 38C P0 81W / 400W | 38189MiB / 40960MiB | 100% Default |
| | | Enabled* |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 426295 C /opt/conda/bin/python 38180MiB |
| 1 N/A N/A 426296 C /opt/conda/bin/python 38180MiB |
+---------------------------------------------------------------------------------------+
values.yaml
daemonset:
enabled: true
securityContext:
privileged: true
image:
repository: quay.io/sustainable_computing_io/kepler
tag: "release-0.7.12-dcgm"
pullPolicy: Always
service:
type: ClusterIP
port: 9102
podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9102"
extraEnvVars:
KEPLER_LOG_LEVEL: "1"
ENABLE_GPU: "true"
ENABLE_QAT: false
ENABLE_EBPF_CGROUPID: true
EXPOSE_HW_COUNTER_METRICS: true
EXPOSE_IRQ_COUNTER_METRICS: true
EXPOSE_CGROUP_METRICS: true
CGROUP_METRICS: '*'
LD_LIBRARY_PATH: "/usr/local/nvidia/lib64"
PATH: "/usr/local/nvidia/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
extraHostVolumes:
- name: nvidia-install-dir-host
hostPath: /home/kubernetes/bin/nvidia
- name: vulkan-icd-mount
hostPath: /home/kubernetes/bin/nvidia/vulkan/icd.d
extraVolumeMounts:
- name: nvidia-install-dir-host
mountPath: /usr/local/nvidia
readOnly: true
- name: vulkan-icd-mount
mountPath: /etc/vulkan/icd.d
readOnly: true
Note when using v0.8.0-dcgm image I got log that ENABLE_GPU is false even I made it true.
when to support GPU power?
@conquerorAlex I am sorry I don't get you