min-LLM
min-LLM copied to clipboard
Improve DeepSpeed Stage 3 Throughput
On 8 A100 with this deepspeed config, below is the measured TFLOPs:
deepspeed --num_gpus 8 train.py --batch_size_per_gpu 36
Estimates: 129.32TFLOPs Avg Iteration Time: 8.01s
Within the megatron-lm paper they report a 175B TFLOPs for non-fused vs fused operator models as 113 teraFLOP/s per GPU to 135 teraFLOP/s per GPU. Considering we're missing some fused kernels (#14) we might be getting close to comparable TFLOPs!
There is also the question as to why sparse attention isn't allowing us to push compute further, but this will remain a separate variable.
cc @tjruwase @jeffra