Vadym Fedorov comments

Results 65 comments of


                                            Vadym Fedorov

Per pod metrics not exposed with time-slicing enabled

The performance metrics require exclusive access to GPU hardware with Turing architecture. If another pod tries to read performance metrics, the DCGM exporter cannot read performance metrics.

Per pod metrics not exposed with time-slicing enabled

@ThisIsQasim, Can you share how you request GPU resources for pods?

Per pod metrics not exposed with time-slicing enabled

@ThisIsQasim , And you use the gpu operator?

Per pod metrics not exposed with time-slicing enabled

Unfortunately, there is known dcgm-exporter limitation: DCGM-Exporter does not support associating metrics to containers when GPU time-slicing is enabled with the NVIDIA Kubernetes Device Plugin.

dcgm-exporter crashes hostengine.

Thank you for reporting the issue.

Expose Container info for MIG enabled GPU

@krishh85 , POD names and namespaces will be available when POD runs workload and uses GPU. By default, the dcgm-exporter returns empty strings when it shows metrics read from the...

Expose Container info for MIG enabled GPU

@krishh85 , Can you provide details how did you run tests and what was an output?

Expose Container info for MIG enabled GPU

@krishh85, Can you provide more details about your environment configuration? What is your k8s-device-plugin (https://github.com/NVIDIA/k8s-device-plugin) configuration? I am especially interested in the MIG_STRATEGY configuration: https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#configuration-option-details. Also, please run a shell...

Expose Container info for MIG enabled GPU

@krishh85 , Thank you for the details. If you have access to the K8S node, where you run the workload, can you try to build https://github.com/k8stopologyawareschedwg/podresourcesapi-tools/tree/main and run the client...

Expose Container info for MIG enabled GPU

@krishh85 , Thank you! It will take time, but we can reproduce it on our end.