DCGM icon indicating copy to clipboard operation
DCGM copied to clipboard

Metrics around capturing gpu FLOPS

Open krishh85 opened this issue 9 months ago • 4 comments

The metric "DCGM_FI_PROF_PIPE_FP64_ACTIVE" is defined as the "Ratio of cycles the fp32 pipe is active". I suppose the units is time here. How do we equate this to FLOPS count. For example in A100 as per the article seems to have max of 19.5TFLOPS (count).

Is it reasonably correct to convert the "DCGM_FI_PROF_PIPE_FP64_ACTIVE" % in absolute FLOP count? i.e 50% value of DCGM_FI_PROF_PIPE_FP64_ACTIVE == 0.5 * 19.5 = 9.75 TFLOPS? There also seems to be different pipes for 16/32 bit operations(DCGM_FI_PROF_PIPE_FP16_ACTIVE/DCGM_FI_PROF_PIPE_FP32_ACTIVE) , how are these related to "DCGM_FI_PROF_PIPE_TENSOR_ACTIVE" metric? What would be the ideal way to get the FLOPs (count) currently being worked by the GPU?

krishh85 avatar May 01 '24 16:05 krishh85

@nikkon-dev Any pointers will be greatly helpful? Thanks

krishh85 avatar May 01 '24 18:05 krishh85

@nikkon-dev @bmarchant , Gently ping on this question?

krishh85 avatar May 03 '24 21:05 krishh85

@krishh85,

There is no method to convert the utilization metrics and compare them to the theoretical FLOP numbers.

FP64_ACTIVE

The percentage of cycles in which the SM execution pipes are active in executing FP64 instructions. This does not include FP64 tensor instructions. Both GA100 and GH100 have high-speed FP64 pipes.

  • For GA100, DFMA, DADD, DMUL, and DSETP can be issued 0.25 instructions/cycle per SM sub-partition to the fp64 pipe.
  • For GH100, DFMA, DADD, DMUL, and DSETP can be issued at 0.5 instructions/cycle per SM sub-partition to the fp64 pipe.

The DFMA pipe and Tensor pipe share a dispatch port, so instructions cannot be issued to both pipes simultaneously.

There is no easy method to convert from pipe active to FP64 FLOPS. DFMA should have a weight of 2. DADD and DMUL should have a weight of 1. Many tools would not include DSETP (comparison) as an operation. Each instruction can execute on 0-32 instructions (due to thread active mask and predication mask). To provide an accurate FLOPs number, the per instruction weight and number of predicated actual instructions (write back the result) must be known.

The {PIPE}_ACTIVE metrics allow you to see what type of operations are performed and to what level of pipe activity is achieved. If a time period has a mix of operations types, it is difficult to quantify the efficiency. For example, if TC_ACTIVE is at 100%, then FP32_ACTIVE or FP16_ACTIVE will likely not be able to achieve more than 50%.

There is no weighted sum across pipes, which makes sense. In Nsight Compute, the primary method for profiling a workload is to look at the unit throughput (or Speed of Light) and determine what units are heavily utilized and what units are not utilized. The monitoring metrics do not cover the L1 and L2 memory subsystems, so the only sign of a memory-limited kernel is DRAM_ACTIVE.

nikkon-dev avatar May 08 '24 23:05 nikkon-dev

@nikkon-dev A quick question before closing this. Are these metrics values in the range 0-1 or 0-100% , the documentation does not state that. Should all metrics that capture "Ratio of cycles" be considered 0-1 %?

krishh85 avatar May 23 '24 19:05 krishh85