dcgm-exporter icon indicating copy to clipboard operation
dcgm-exporter copied to clipboard

Zero values for MIG instances using dcgm-exporter.

Open Shadowphax opened this issue 3 years ago • 10 comments

Hi,

DCGM Version: 2.2.9 CUDA: 11.4 Driver: datacenter-gpu-manager-2.2.9-1.x86_64

We have recently purchased a Dell R750xa with x4 A100-40GB GPUs. I built the dcgm-exporter binary from source and when running can obtain values of the parent GPU cards. However, all values are set as zero for MIG instances even though we have utilization on the MIG instances.

I have also noticed that not all the MIG profiles are listed.

Thank You !!

Shadowphax avatar Aug 23 '21 19:08 Shadowphax

Hi @Shadowphax,

Have you tried to run dcgm-exporter with --devices f or --devices i command line argument? By default, the dcgm-exporter will not monitor MIG instances. You either need to specify devices explicitly or set "Flex" mode, when dcgm-exporter will monitor all MIG instances instead of all GPUs by default.

I made a pull request to improve --devices documentation a bit https://github.com/NVIDIA/dcgm-exporter/pull/4 In the current state, if you need both GPUs and MIG Instances, you need to specify them explicitly in --devices g:0,g:1...,i:0,i:1,...

nikkon-dev avatar Aug 23 '21 23:08 nikkon-dev

Hi @nikkon-dev

Appreciate the feedback, thank you.

dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv --devices g:2,3,i:0 INFO[0000] Starting dcgm-exporter
FATA[0000] Invalid ranged device option 'g:2,3,i:0': there can only be one specified range

I am not sure if I have the syntax correct? Also, MIG IDs (instance IDs) are required as input for "i". Are we meant to use the ID from "nvidia-smi" for the instance ID?

Shadowphax avatar Aug 24 '21 14:08 Shadowphax

@Shadowphax, You are right. In the current implementation, dcgm-exporter does not support multiple types of ranges and more than one range in general. This limitation is on the dcgm-exporter side only, as DCGM itself supports such scenarios.

nikkon-dev avatar Aug 24 '21 21:08 nikkon-dev

Hi @nikkon-dev

Could you provide guidance on how to monitor a single MIG instance with dcgm-exporter, especially to identify the correct MIG instance IDs? I've partitioned the cards appropriately. Is there something I am meant to do with dcgmi in order for dcgm-exporter to expose the metrics for MIG?

dcgmi group -l +-------------------+----------------------------------------------------------+ | GROUPS | | 2 groups found. | +===================+==========================================================+ | Groups | | | -> 2 | | | -> Group ID | 2 | | -> Group Name | card0 | | -> Entities | GPU 0, GPU_I 3, GPU_I 2, GPU_I 0, GPU_I 1 | | -> 3 | | | -> Group ID | 3 | | -> Group Name | card1 | | -> Entities | GPU 1, GPU_I 10, GPU_I 7, GPU_I 8, GPU_I 9 | +-------------------+----------------------------------------------------------+

So I have four MIG instances on each card. I would like to know where the ID is obtained per MIG for --devices i:(x) where x is either 0 or 1 as mentioned in --help

Cheers

Shadowphax avatar Aug 25 '21 09:08 Shadowphax

@Shadowphax,

Would you please look if dcgmi discovery -c would give you the info you need?

WBR, Nik

nikkon-dev avatar Aug 25 '21 15:08 nikkon-dev

@nikkon-dev Hi, I got zero value for MIG instance too. CUDA: 11.4 Driver: datacenter-gpu-manager 1:2.3.1 amd64 1637292603(1)

dcgm-exporter --devices=i:2

1637292721(1)

1637292787(1)

Thank you!

xwhuang0923 avatar Nov 19 '21 03:11 xwhuang0923

@xwhuang0923,

Please take a look at the dcgmi discovery -c output. In the --device=i:X argument, the X is the entity ID from the discovery command output, not the MIG Dev from the nvidia-smi output.

WBR, Nik

nikkon-dev avatar Nov 19 '21 22:11 nikkon-dev

@nikkon-dev Here is the output:

image

xwhuang0923 avatar Nov 20 '21 07:11 xwhuang0923

We have been making quite a few enhancements to DCGM and DCGM-Exporter with respect to MIG. Some of these fixes/enhancements are available in 2.3.2-2.6.3 and others are coming in next version as well. Please try upgrading now or in a few weeks and letting us know if these latest versions do not fix the issue.

glowkey avatar Feb 18 '22 18:02 glowkey

@glowkey Is there any example for mig instance states monitoring such as insatance memory or instance utilization?

dixing0908 avatar Jun 10 '22 07:06 dixing0908