Megatron-LM [QUESTION]How to calculate MFU based on the flops?

when I train Qwen2.5-32B with Megatron, I found the throughput was 420 or so using H200x2 to train, the partion
were tp=4,pp=2, so according to the mfu calculation, the util of GPU is 420/1979, that is very small, why is that? Was the logit of num_floating_point_operations training.py wrong? What's the approximate number when you guys train such model?

May 06 '25 06:05 Lynnzake

The FP16/BF16 1979 TFLOPS defined in H200 spec is with sparsity, I think the actual MFU should be 420/(1979/2)=42.45%

May 20 '25 14:05 JiaxiangZheng

The FP16/BF16 1979 TFLOPS defined in H200 spec is with sparsity, I think the actual MFU should be 420/(1979/2)=42.45%

Could you explain what does sparsity mean?

May 21 '25 09:05 Lynnzake

Please refer to https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/

Jun 09 '25 05:06 yzlnew

Please refer to https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/

Thanks for the reference!

Jun 11 '25 05:06 Lynnzake

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Jul 26 '25 02:07 github-actions[bot]