kepler icon indicating copy to clipboard operation
kepler copied to clipboard

Can't get GPU metrics

Open mohamedosama113 opened this issue 7 months ago • 2 comments

I deployed Kepler Helm on GKE ubuntu nodes and these are logs of kepler pod

WARNING: failed to read int from file: open /sys/devices/system/cpu/cpu0/online: no such file or directory
I0603 21:11:19.492830       1 dcgm.go:70] Initializing dcgm Successful
E0603 21:11:19.492950       1 device.go:103] Device with type NVML doesn't exist
I0603 21:11:19.492985       1 dcgm.go:73] Using DCGM to obtain processor power
I0603 21:11:19.493087       1 exporter.go:103] Kepler running on version: v0.7.12-dirty
I0603 21:11:19.493168       1 config.go:293] using gCgroup ID in the BPF program: true
I0603 21:11:19.493224       1 config.go:295] kernel version: 6.8
I0603 21:11:19.493295       1 rapl_msr_util.go:129] failed to open path /dev/cpu/0/msr: no such file or directory
I0603 21:11:19.493354       1 power.go:78] Unable to obtain power, use estimate method
I0603 21:11:19.493374       1 redfish.go:169] failed to get redfish credential file path
I0603 21:11:19.494274       1 acpi.go:71] Could not find any ACPI power meter path. Is it a VM?
I0603 21:11:19.494300       1 power.go:79] using none to obtain power
I0603 21:11:19.535542       1 dcgm.go:451] Created device group "dev-grp-2025-06-03-21-11-19"
E0603 21:11:19.549839       1 dcgm.go:120] failed to set up watcher: failed to set up watcher, err Error watching fields: The third-party Profiling module returned an unrecoverable error
E0603 21:11:19.554525       1 dcgm.go:85] failed to StartupDevice: failed to set up watcher, err Error watching fields: The third-party Profiling module returned an unrecoverable error
E0603 21:11:19.554568       1 accelerator.go:161] Could not init the GPU device going to try again
I0603 21:11:25.581403       1 dcgm.go:451] Created device group "dev-grp-2025-06-03-21-11-25"
E0603 21:11:25.595984       1 dcgm.go:120] failed to set up watcher: failed to set up watcher, err Error watching fields: The third-party Profiling module returned an unrecoverable error
E0603 21:11:25.601057       1 dcgm.go:85] failed to StartupDevice: failed to set up watcher, err Error watching fields: The third-party Profiling module returned an unrecoverable error
E0603 21:11:25.601091       1 accelerator.go:161] Could not init the GPU device going to try again
I0603 21:11:31.627861       1 dcgm.go:451] Created device group "dev-grp-2025-06-03-21-11-31"
E0603 21:11:31.641282       1 dcgm.go:120] failed to set up watcher: failed to set up watcher, err Error watching fields: The third-party Profiling module returned an unrecoverable error
E0603 21:11:31.646135       1 dcgm.go:85] failed to StartupDevice: failed to set up watcher, err Error watching fields: The third-party Profiling module returned an unrecoverable error
E0603 21:11:31.646162       1 accelerator.go:161] Could not init the GPU device going to try again
I0603 21:11:37.672270       1 dcgm.go:451] Created device group "dev-grp-2025-06-03-21-11-37"
E0603 21:11:37.686624       1 dcgm.go:120] failed to set up watcher: failed to set up watcher, err Error watching fields: The third-party Profiling module returned an unrecoverable error
E0603 21:11:37.691054       1 dcgm.go:85] failed to StartupDevice: failed to set up watcher, err Error watching fields: The third-party Profiling module returned an unrecoverable error
E0603 21:11:37.691077       1 accelerator.go:161] Could not init the GPU device going to try again
I0603 21:11:43.722390       1 dcgm.go:451] Created device group "dev-grp-2025-06-03-21-11-43"
E0603 21:11:43.749124       1 dcgm.go:120] failed to set up watcher: failed to set up watcher, err Error watching fields: The third-party Profiling module returned an unrecoverable error
E0603 21:11:43.755955       1 dcgm.go:85] failed to StartupDevice: failed to set up watcher, err Error watching fields: The third-party Profiling module returned an unrecoverable error
E0603 21:11:43.756006       1 accelerator.go:161] Could not init the GPU device going to try again

It reports Metrics with DCGM source but all with zero values. DCGM exporter is running on the same system and works fine.

Driver info:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0              91W / 400W |  38189MiB / 40960MiB |    100%      Default |
|                                         |                      |             Enabled* |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          Off | 00000000:00:05.0 Off |                    0 |
| N/A   38C    P0              81W / 400W |  38189MiB / 40960MiB |    100%      Default |
|                                         |                      |             Enabled* |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    426295      C   /opt/conda/bin/python                     38180MiB |
|    1   N/A  N/A    426296      C   /opt/conda/bin/python                     38180MiB |
+---------------------------------------------------------------------------------------+

values.yaml

daemonset:
  enabled: true
securityContext:
  privileged: true
image:
  repository: quay.io/sustainable_computing_io/kepler
  tag: "release-0.7.12-dcgm"
  pullPolicy: Always
service:
  type: ClusterIP
  port: 9102
podAnnotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "9102"
extraEnvVars:
  KEPLER_LOG_LEVEL: "1"
  ENABLE_GPU: "true"
  ENABLE_QAT: false
  ENABLE_EBPF_CGROUPID: true
  EXPOSE_HW_COUNTER_METRICS: true
  EXPOSE_IRQ_COUNTER_METRICS: true
  EXPOSE_CGROUP_METRICS: true
  CGROUP_METRICS: '*'
  LD_LIBRARY_PATH: "/usr/local/nvidia/lib64"
  PATH: "/usr/local/nvidia/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
extraHostVolumes:
  - name: nvidia-install-dir-host
    hostPath: /home/kubernetes/bin/nvidia
  - name: vulkan-icd-mount
    hostPath: /home/kubernetes/bin/nvidia/vulkan/icd.d
extraVolumeMounts:
  - name: nvidia-install-dir-host
    mountPath: /usr/local/nvidia
    readOnly: true
  - name: vulkan-icd-mount
    mountPath: /etc/vulkan/icd.d
    readOnly: true

Note when using v0.8.0-dcgm image I got log that ENABLE_GPU is false even I made it true.

mohamedosama113 avatar Jun 03 '25 11:06 mohamedosama113

when to support GPU power?

conquerorAlex avatar Jul 18 '25 02:07 conquerorAlex

@conquerorAlex I am sorry I don't get you

mohamedosama113 avatar Jul 29 '25 19:07 mohamedosama113