Vadym Fedorov
Vadym Fedorov
The performance metrics require exclusive access to GPU hardware with Turing architecture. If another pod tries to read performance metrics, the DCGM exporter cannot read performance metrics.
@ThisIsQasim, Can you share how you request GPU resources for pods?
@ThisIsQasim , And you use the gpu operator?
Unfortunately, there is known dcgm-exporter limitation: DCGM-Exporter does not support associating metrics to containers when GPU time-slicing is enabled with the NVIDIA Kubernetes Device Plugin.
Thank you for reporting the issue.
@krishh85 , POD names and namespaces will be available when POD runs workload and uses GPU. By default, the dcgm-exporter returns empty strings when it shows metrics read from the...
@krishh85 , Can you provide details how did you run tests and what was an output?
@krishh85, Can you provide more details about your environment configuration? What is your k8s-device-plugin (https://github.com/NVIDIA/k8s-device-plugin) configuration? I am especially interested in the MIG_STRATEGY configuration: https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#configuration-option-details. Also, please run a shell...
@krishh85 , Thank you for the details. If you have access to the K8S node, where you run the workload, can you try to build https://github.com/k8stopologyawareschedwg/podresourcesapi-tools/tree/main and run the client...
@krishh85 , Thank you! It will take time, but we can reproduce it on our end.