[QUESTION]How to calculate MFU based on the flops?
when I train Qwen2.5-32B with Megatron, I found the throughput was 420 or so using H200x2 to train, the partion
were tp=4,pp=2, so according to the mfu calculation, the util of GPU is 420/1979, that is very small, why is that? Was the logit of num_floating_point_operations training.py wrong? What's the approximate number when you guys train such model?
The FP16/BF16 1979 TFLOPS defined in H200 spec is with sparsity, I think the actual MFU should be 420/(1979/2)=42.45%
The FP16/BF16 1979 TFLOPS defined in H200 spec is with sparsity, I think the actual MFU should be
420/(1979/2)=42.45%
Could you explain what does sparsity mean?
Please refer to https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/
Please refer to https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/
Thanks for the reference!
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.