min-LLM icon indicating copy to clipboard operation
min-LLM copied to clipboard

Improve DeepSpeed Stage 3 Throughput

Open SeanNaren opened this issue 1 year ago • 0 comments

On 8 A100 with this deepspeed config, below is the measured TFLOPs:

deepspeed --num_gpus 8 train.py --batch_size_per_gpu 36
Estimates: 129.32TFLOPs Avg Iteration Time: 8.01s

Within the megatron-lm paper they report a 175B TFLOPs for non-fused vs fused operator models as 113 teraFLOP/s per GPU to 135 teraFLOP/s per GPU. Considering we're missing some fused kernels (#14) we might be getting close to comparable TFLOPs!

There is also the question as to why sparse attention isn't allowing us to push compute further, but this will remain a separate variable.

cc @tjruwase @jeffra

SeanNaren avatar Jul 11 '22 15:07 SeanNaren