dcgm-exporter icon indicating copy to clipboard operation
dcgm-exporter copied to clipboard

Metric about compute apps

Open onstring opened this issue 2 years ago • 2 comments

Do we have any metrics / Is it worthy to add a metric about the GPU allocated compute process, just like the following output of nvidia-smi:

> nvidia-smi --query-compute-apps=gpu_uuid,name --format=csv
gpu_uuid, process_name
GPU-d0180485-9584-433c-6782-c335d5df2cb3, vgpu
GPU-777ead31-954e-837f-590f-6c4974d8e571, vgpu
GPU-777ead31-954e-837f-590f-6c4974d8e571, vgpu

onstring avatar Aug 19 '22 01:08 onstring

Hi @onstring,

There are no such metrics as of today. DCGM does not have fields with such information, but there is an API to collect information about running PIDs.

What form would you want to see this information, and what utility should it have? I can imagine a metric with the total number of processes occupying a GPU, but I do not see how exact processes could be represented or used here. Could you elaborate?

nikkon-dev avatar Aug 19 '22 02:08 nikkon-dev

The scenario is in our cloud platform, besides those instances using GPU, we also have many instances only using normal compute/CPU resources. So we would like to know the statistics about how many GPUs are occupied.

For example, from the above nvidia-smi output, we would like to know the number of processes(maybe processes names) for each GPU instance:

GPU-d0180485-9584-433c-6782-c335d5df2cb3, 1
GPU-777ead31-954e-837f-590f-6c4974d8e571, 2

onstring avatar Aug 19 '22 04:08 onstring