DCGM Support for reporting FP8 and Transformer Engine usage on H100 GPU's

Support for reporting FP8 and Transformer Engine usage on H100 GPU's

Open hassanbabaie opened this issue 1 year ago • 4 comments

I'm wondering what the plan is on being able to breakout and report on FP8 and Transformer Engine usage on H100's via DCGM (and so we then get it via DCGM Exporter)

DCGM supports FP64,FP32,FP16 but it seems like we're missing an update to be able to break out/detect usage some of the new features

I doubled checked here and don't see obvious one that I would look at to detect this type of usage?

dcgmlib/dcgm_fields.h

Any thoughts on this would be appreciated

Thanks

Jun 22 '23 16:06 hassanbabaie

Hello @hassanbabaie,

Unfortunately, it is currently not possible to break down pipelines in order to isolate FP8 utilization.

Jul 01 '23 06:07 nikkon-dev

A good recommendation is to review these

DCGM_FI_PROF_PIPE_TENSOR_IMMA_ACTIVE
DCGM_FI_PROF_PIPE_TENSOR_HMMA_ACTIVE

The IMMA is int8/fp8 tensor instructions. HMMA is FP16/32 tensor.

This would give you some sort of correlation as to usage of the tensors; I recommend doing some heuristics to see if they correlate as one might expect.

Sep 26 '23 20:09 rnertney

Hi @rnertney just a quick heads up, I'm not sure if we're seeing this. We had an FP8 run and did not see trigger the IMMA metric

Nov 03 '23 21:11 hassanbabaie

@rnertney any luck on the above ^^ thanks again

Dec 19 '23 19:12 hassanbabaie

DCGM DCGM copied to clipboard

Support for reporting FP8 and Transformer Engine usage on H100 GPU's

DCGM
DCGM copied to clipboard