dcgm-exporter Per pod metrics not exposed with time-slicing enabled

What is the version?

3.3.5-3.4.1

What happened?

Metrics like DCGM_FI_PROF_GR_ENGINE_ACTIVE are only exposed for one single pod even though there are multiple pods that use the same GPU

What did you expect to happen?

Metrics for the all the pods should be exposed

What is the GPU model?

Tesla T4

What is the environment?

GKE

How did you deploy the dcgm-exporter and what is the configuration?

No response

How to reproduce the issue?

Enable time-slicing using device plugin
Deploy DCGM and dcgm-exporter
Deploy app that uses GPU
Check metrics

Anything else we need to know?

From the debug log

time="2024-04-05T13:49:04Z" level=debug msg="Device to pod mapping: map[nvidia0:{Name:gpu-pod-c69f6664f-vkkcb Namespace:default Container:extractor} nvidia0/vgpu0:{Name:gpu-pod-c69f6664f-vkkcb Namespace:default Container:extractor} nvidia0/vgpu1:{Name:gpu-pod-c69f6664f-2v922 Namespace:default Container:extractor} nvidia0/vgpu2:{Name:gpu-pod-c69f6664f-wrcxw Namespace:default Container:extractor} nvidia0/vgpu3:{Name:gpu-pod-c69f6664f-ffs8r Namespace:default Container:extractor}]"

Apr 05 '24 13:04 ThisIsQasim

This appears to have been reported repeatedly #151 #201 #222

Apr 05 '24 14:04 ThisIsQasim

The performance metrics require exclusive access to GPU hardware with Turing architecture. If another pod tries to read performance metrics, the DCGM exporter cannot read performance metrics.

Apr 05 '24 16:04 nvvfedorov

there is only pod per node trying to read the metrics but multiple pods using the same GPU. The issue is that dcgm exporter should report metrics for all the pods using the GPU.

Apr 05 '24 17:04 ThisIsQasim

@ThisIsQasim, Can you share how you request GPU resources for pods?

Apr 05 '24 18:04 nvvfedorov

Sure. A single GPU is advertised as multiple using the nvidia device plugin

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4

and then GPUs are requested with the regular resource requests

esources:
  requests:
    cpu: 3600m
  limits:
    memory: 13000Mi
    nvidia.com/gpu: "1"

Apr 05 '24 20:04 ThisIsQasim

@ThisIsQasim , And you use the gpu operator?

Apr 06 '24 01:04 nvvfedorov

I do not. It’s manually deployed.

Apr 06 '24 08:04 ThisIsQasim

Unfortunately, there is known dcgm-exporter limitation: DCGM-Exporter does not support associating metrics to containers when GPU time-slicing is enabled with the NVIDIA Kubernetes Device Plugin.

Apr 08 '24 14:04 nvvfedorov

Unfortunately, there is known dcgm-exporter limitation: DCGM-Exporter does not support associating metrics to containers when GPU time-slicing is enabled with the NVIDIA Kubernetes Device Plugin.

Is there a known root-cause for this issue?

From what I've dug up:

Pods using timesliced GPUs append a -<idx> to the end of their deviceIDs, like so:

&ContainerDevices{ResourceName:nvidia.com/gpu,DeviceIds:[GPU-51424525-5928-4e4c-2503-8ca3bca0b134-2],}

Thanks @larry-lu-lu (https://github.com/NVIDIA/dcgm-exporter/issues/201#issuecomment-1825284066).

Therefore when the deviceToPodMap is updated here, none of the pods using the GPU are associated with the base deviceID. Execution then reaches this loop and, because none of the pods in deviceToPod are associated with the baseID, dcgm-exporter totally skips the pod/namespace label and moves on.

Unfortunately there doesn't seem to be a quick fix. As far as I understand, the DCGM metrics we collect are associated with exactly one UUID. This is OK for MIGs because they will each have a unique UUID. But metrics on time-sliced GPUs will, if I'm not mistaken, have the UUID of the base device, without an index attached.

@nikkon-dev and others, forgive me for pinging, I would really like to know if my understanding is correct here.

May 17 '24 19:05 svetly-todorov

I understand that meanwhile it will be the same for mps new support in device plugin - per- pod metrics will not be shown is it correct? another quest, if we have pods that are not requesting gpu through device plugin but are able to use GPU due to some tricks (mounts etc.) can they be reported to dcgm when they use GPU?

May 23 '24 05:05 ettelr

dcgm-exporter dcgm-exporter copied to clipboard

Per pod metrics not exposed with time-slicing enabled

What is the version?

What happened?

What did you expect to happen?

What is the GPU model?

What is the environment?

How did you deploy the dcgm-exporter and what is the configuration?

How to reproduce the issue?

Anything else we need to know?

dcgm-exporter
dcgm-exporter copied to clipboard