DCGM
DCGM copied to clipboard
Support for reporting FP8 and Transformer Engine usage on H100 GPU's
I'm wondering what the plan is on being able to breakout and report on FP8 and Transformer Engine usage on H100's via DCGM (and so we then get it via DCGM Exporter)
DCGM supports FP64,FP32,FP16 but it seems like we're missing an update to be able to break out/detect usage some of the new features
I doubled checked here and don't see obvious one that I would look at to detect this type of usage?
Any thoughts on this would be appreciated
Thanks
Hello @hassanbabaie,
Unfortunately, it is currently not possible to break down pipelines in order to isolate FP8 utilization.
A good recommendation is to review these
-
DCGM_FI_PROF_PIPE_TENSOR_IMMA_ACTIVE
-
DCGM_FI_PROF_PIPE_TENSOR_HMMA_ACTIVE
The IMMA is int8/fp8 tensor instructions. HMMA is FP16/32 tensor.
This would give you some sort of correlation as to usage of the tensors; I recommend doing some heuristics to see if they correlate as one might expect.
Hi @rnertney just a quick heads up, I'm not sure if we're seeing this. We had an FP8 run and did not see trigger the IMMA metric
@rnertney any luck on the above ^^ thanks again