DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Comparison of native AllReduce and compressed AllReduce in DeepSpeed

Open hubertlu-tw opened this issue 3 years ago • 3 comments

Hi, I tested the native AllReduce (deepspeed.comm.all_reduce) and the compressed AllReduce (backend.compressed_allreduce) in DeepSpeed with this test script. On a ROCm system, we observed 414% performance improvement of switching from the compressed AllReduce to the native AllReduce.

I am wondering what the motivation of using the compressed AllReduce in 1-bit compression is and what the possible limitation of switching back to the native AllReduce is.

hubertlu-tw avatar Sep 08 '22 22:09 hubertlu-tw

cc @sunway513 @rraminen @jithunnair-amd

jithunnair-amd avatar Sep 09 '22 16:09 jithunnair-amd

@jeffra @tjruwase

jithunnair-amd avatar Sep 09 '22 16:09 jithunnair-amd

Hi @hubertlu-tw @jithunnair-amd, @awan-10 and I worked on this 1-bit compression and the 1-bit Adam, 0/1Adam, 1-bit LAMB optimizers. First, compressed AllReduce is necessary because you can't do compression communication on native AllReduce without custom primitives (https://github.com/microsoft/DeepSpeed/blob/b2d550ab850948458abacb167577603bd7b3ab5f/deepspeed/runtime/comm/nccl.py#L51). Second, for the compressed AllReduce's worse performance you observed there are two caveats: (1) We never tested compressed AllReduce on ROCm/AMD hardware, and as pointed above the compressed_allreduce is a complex primitive based on NCCL (there is also a MPI version but we recommend the NCCL-based). So we are not sure whether it was running correctly on ROCm in your test. (2) We want to know what's the hardware environment (num gpus, num node, network type (ethernet or Infiniband) and network bandwidth), because these will affect the benefit by 1-bit compression: when the num gpus/nodes is extremely small and/or when network is very fast Infiniband, 1-bit compression will has less or even potentially negative benefit, because the benefit of smaller communication volume is relatively small in such environments, and 1-bit compression has additional compression computation overhead. For more details related to 1-bit compression, we recommend reading our papers https://proceedings.mlr.press/v139/tang21a.html, https://arxiv.org/abs/2104.06069, https://arxiv.org/abs/2202.06009. @awan-10 feel free to add more comment if I missed anything.

conglongli avatar Sep 09 '22 17:09 conglongli

@hubertlu-tw @jithunnair-amd, please re-open if you have remaining questions/concerns here.

jeffra avatar Sep 26 '22 22:09 jeffra