tvm [Disco] Add MSCCLPP initialization along side NCCL

[Disco] Add MSCCLPP initialization along side NCCL

Open csullivan opened this issue 1 year ago • 1 comments

Use CCL type traits to share common code between NCCL and MSCCLPP API invocations in disco
Add bench to validate results and compare various supported CCL approaches for cuda.

Aggregated profiling results over the sweep of transfer sizes introduced in the above mentioned bench 2**(12 -> 24).

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     33.2        277809887       4048   68628.9   45504.0     15104  25069360     545729.8  ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm *, unsigned long, ncclWork *)
     24.3        203015590       4040   50251.4   48256.0     12992    315934      29470.2  void tensorrt_llm::twoShotAllReduceKernel<__half, (int)8>(tensorrt_llm::AllReduceParams)
     20.9        174549284       4040   43205.3   39440.0      3711   1085846      52387.0  void tensorrt_llm::oneShotAllReduceKernel<__half, (int)8>(tensorrt_llm::AllReduceParams)
     20.7        173275472       4040   42890.0   40112.0      5375    750653      42864.2  void tvm::runtime::allreduce_simple<__half>(mscclpp::SmChannelDeviceHandle *, const T1 *, T1 *, voi…
      0.8          6985121        120   58209.3   55871.5      9695    158239      35854.8  ncclDevKernel_AllGather_RING_LL(ncclDevComm *, unsigned long, ncclWork *)

I noted significant variance between runs, so e2e or use of cuda graph launch for synchronization could help give a clearer picture.

Apr 03 '24 17:04 csullivan

@csullivan please fix the lint error

Apr 09 '24 18:04 tqchen

tvm tvm copied to clipboard

[Disco] Add MSCCLPP initialization along side NCCL

tvm
tvm copied to clipboard