min-LLM Improve DeepSpeed Stage 3 Throughput

Improve DeepSpeed Stage 3 Throughput

Open SeanNaren opened this issue 1 year ago • 0 comments

On 8 A100 with this deepspeed config, below is the measured TFLOPs:

deepspeed --num_gpus 8 train.py --batch_size_per_gpu 36

Estimates: 129.32TFLOPs Avg Iteration Time: 8.01s

Within the megatron-lm paper they report a 175B TFLOPs for non-fused vs fused operator models as 113 teraFLOP/s per GPU to 135 teraFLOP/s per GPU. Considering we're missing some fused kernels (#14) we might be getting close to comparable TFLOPs!

There is also the question as to why sparse attention isn't allowing us to push compute further, but this will remain a separate variable.

cc @tjruwase @jeffra

Jul 11 '22 15:07 SeanNaren

min-LLM min-LLM copied to clipboard

Improve DeepSpeed Stage 3 Throughput

min-LLM
min-LLM copied to clipboard