dcgm-exporter Metric about compute apps

Metric about compute apps

Open onstring opened this issue 2 years ago • 2 comments

Do we have any metrics / Is it worthy to add a metric about the GPU allocated compute process, just like the following output of nvidia-smi:

> nvidia-smi --query-compute-apps=gpu_uuid,name --format=csv
gpu_uuid, process_name
GPU-d0180485-9584-433c-6782-c335d5df2cb3, vgpu
GPU-777ead31-954e-837f-590f-6c4974d8e571, vgpu
GPU-777ead31-954e-837f-590f-6c4974d8e571, vgpu

Aug 19 '22 01:08 onstring

Hi @onstring,

There are no such metrics as of today. DCGM does not have fields with such information, but there is an API to collect information about running PIDs.

What form would you want to see this information, and what utility should it have? I can imagine a metric with the total number of processes occupying a GPU, but I do not see how exact processes could be represented or used here. Could you elaborate?

Aug 19 '22 02:08 nikkon-dev

The scenario is in our cloud platform, besides those instances using GPU, we also have many instances only using normal compute/CPU resources. So we would like to know the statistics about how many GPUs are occupied.

For example, from the above nvidia-smi output, we would like to know the number of processes(maybe processes names) for each GPU instance:

GPU-d0180485-9584-433c-6782-c335d5df2cb3, 1
GPU-777ead31-954e-837f-590f-6c4974d8e571, 2

Aug 19 '22 04:08 onstring

dcgm-exporter dcgm-exporter copied to clipboard

Metric about compute apps

dcgm-exporter
dcgm-exporter copied to clipboard