Comparison of native AllReduce and compressed AllReduce in DeepSpeed
Hi, I tested the native AllReduce (deepspeed.comm.all_reduce) and the compressed AllReduce (backend.compressed_allreduce) in DeepSpeed with this test script. On a ROCm system, we observed 414% performance improvement of switching from the compressed AllReduce to the native AllReduce.
I am wondering what the motivation of using the compressed AllReduce in 1-bit compression is and what the possible limitation of switching back to the native AllReduce is.
cc @sunway513 @rraminen @jithunnair-amd
@jeffra @tjruwase
Hi @hubertlu-tw @jithunnair-amd, @awan-10 and I worked on this 1-bit compression and the 1-bit Adam, 0/1Adam, 1-bit LAMB optimizers. First, compressed AllReduce is necessary because you can't do compression communication on native AllReduce without custom primitives (https://github.com/microsoft/DeepSpeed/blob/b2d550ab850948458abacb167577603bd7b3ab5f/deepspeed/runtime/comm/nccl.py#L51). Second, for the compressed AllReduce's worse performance you observed there are two caveats: (1) We never tested compressed AllReduce on ROCm/AMD hardware, and as pointed above the compressed_allreduce is a complex primitive based on NCCL (there is also a MPI version but we recommend the NCCL-based). So we are not sure whether it was running correctly on ROCm in your test. (2) We want to know what's the hardware environment (num gpus, num node, network type (ethernet or Infiniband) and network bandwidth), because these will affect the benefit by 1-bit compression: when the num gpus/nodes is extremely small and/or when network is very fast Infiniband, 1-bit compression will has less or even potentially negative benefit, because the benefit of smaller communication volume is relatively small in such environments, and 1-bit compression has additional compression computation overhead. For more details related to 1-bit compression, we recommend reading our papers https://proceedings.mlr.press/v139/tang21a.html, https://arxiv.org/abs/2104.06069, https://arxiv.org/abs/2202.06009. @awan-10 feel free to add more comment if I missed anything.
@hubertlu-tw @jithunnair-amd, please re-open if you have remaining questions/concerns here.