gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Not able to view Gpu utilization metrics in openshift dashboard

Open umeshvw opened this issue 1 year ago • 7 comments

Environment:

Openshift version: 4.16.10 nvidia-operator- version: 24.6.1

Hello Team,

We are facing below issue:

Issue 1:

in administrator space, we are not able to view few important metrics in nvidia DCGM Exporter Dashboard such as :

1: GPU utilization 2: GPU Framebuffer Mem Used 3: Tensor Core Utilization

We are able to view few metrics such as gpu temperature etc but above metrics are much important for us.

Issue 2 : In developer space

We are not able to see any metrics in nvidia DCGM Exporter Dashboard. We are able to see few metrics in administrator space but not able to see any metrics in developer space. Is there any way we can monitor gpu utilization per namespace also so that application team can monitor gpu utilization in their namespace on their own.

Issue 3: In section compute > GPU , we are not able to see any Realtime utilization date. Every time gpu utilization metrics are showing as 0%.

I am attaching screenshots for all the issues.

umeshvw avatar Sep 20 '24 12:09 umeshvw

Image Image Image

umeshvw avatar Sep 20 '24 12:09 umeshvw

Hello Team, Any update on above issue?

umeshvw avatar Sep 30 '24 13:09 umeshvw

Hello Nvidia Team,

Can someone please help with above?

umeshvw avatar Oct 09 '24 07:10 umeshvw

Hi @shivamerla I hope you are doing well. Can you please help here or let us know if someone from your team can help with this issue? Many thanks

arpitsharma-vw avatar Oct 14 '24 06:10 arpitsharma-vw

@arpitsharma-vw was it working before and are you seeing as a regression? Is the gpu-operator/dcgm-exporter configured with custom metrics or default ones? I don't think there is any RBAC issue here with the operator/dcgm-exporter itself, as the exporter uses pod resources API which will provide metrics from all Pods using GPUs from all namespaces. Can you double check the RBAC setup in the developer mode to scrape any metrics in general? @cdesiniotis @tariq1890 can help debug further.

shivamerla avatar Oct 14 '24 19:10 shivamerla

@shivamerla I think it is custom one as per below.

$ oc get daemonset nvidia-dcgm-exporter -o yaml |grep -i etc value: /etc/dcgm-exporter/dcgm-metrics.csv

and below is the file

sh-5.1# cat /etc/dcgm-exporter/dcgm-metrics.csv DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, gpu utilization. DCGM_FI_DEV_MEM_COPY_UTIL, gauge, mem utilization. DCGM_FI_DEV_ENC_UTIL, gauge, enc utilization. DCGM_FI_DEV_DEC_UTIL, gauge, dec utilization. DCGM_FI_DEV_POWER_USAGE, gauge, power usage. DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX, gauge, power mgmt limit. DCGM_FI_DEV_GPU_TEMP, gauge, gpu temp. DCGM_FI_DEV_SM_CLOCK, gauge, sm clock. DCGM_FI_DEV_MAX_SM_CLOCK, gauge, max sm clock. DCGM_FI_DEV_MEM_CLOCK, gauge, mem clock. DCGM_FI_DEV_MAX_MEM_CLOCK, gauge, max mem clock. sh-5.1#

looks like DCGM_FI_DEV_GPU_UTIL metrics is not included in above file which is present in file default-counters.csv Please let us know if you need further details

umeshvw avatar Oct 21 '24 08:10 umeshvw

@shivamerla We are able to see metrics after adding below metrics to Configmap (console-plugin-nvidia-gpu)

DCGM_FI_DEV_GPU_UTIL DCGM_FI_DEV_FB_USED DCGM_FI_PIPE_PROF_TENSOR_ACTIVE

umeshvw avatar Oct 21 '24 12:10 umeshvw

The operator is not creating a ServiceMonitor resource for the exporter. it should create this

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app: nvidia-dcgm-exporter
  name: nvidia-dcgm-exporter
  namespace: nvidia-gpu-operator
spec:
  endpoints:
  - path: /metrics
    port: gpu-metrics
  jobLabel: gpu-status
  namespaceSelector:
    matchNames:
    - nvidia-gpu-operator
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter

paulczar avatar Dec 11 '24 21:12 paulczar

thx that solved my issue on ROKS in IBM Cloud with GPU nodes

gigderoma avatar Feb 12 '25 12:02 gigderoma

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 04 '25 22:11 github-actions[bot]

Closing this as it looks like it was solved in this comment.

rajathagasthya avatar Nov 14 '25 16:11 rajathagasthya