tvm icon indicating copy to clipboard operation
tvm copied to clipboard

[Disco] Add MSCCLPP initialization along side NCCL

Open csullivan opened this issue 1 year ago • 1 comments

  • Use CCL type traits to share common code between NCCL and MSCCLPP API invocations in disco
  • Add bench to validate results and compare various supported CCL approaches for cuda.

Aggregated profiling results over the sweep of transfer sizes introduced in the above mentioned bench 2**(12 -> 24).

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     33.2        277809887       4048   68628.9   45504.0     15104  25069360     545729.8  ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm *, unsigned long, ncclWork *)
     24.3        203015590       4040   50251.4   48256.0     12992    315934      29470.2  void tensorrt_llm::twoShotAllReduceKernel<__half, (int)8>(tensorrt_llm::AllReduceParams)
     20.9        174549284       4040   43205.3   39440.0      3711   1085846      52387.0  void tensorrt_llm::oneShotAllReduceKernel<__half, (int)8>(tensorrt_llm::AllReduceParams)
     20.7        173275472       4040   42890.0   40112.0      5375    750653      42864.2  void tvm::runtime::allreduce_simple<__half>(mscclpp::SmChannelDeviceHandle *, const T1 *, T1 *, voi…
      0.8          6985121        120   58209.3   55871.5      9695    158239      35854.8  ncclDevKernel_AllGather_RING_LL(ncclDevComm *, unsigned long, ncclWork *)

I noted significant variance between runs, so e2e or use of cuda graph launch for synchronization could help give a clearer picture.

csullivan avatar Apr 03 '24 17:04 csullivan

@csullivan please fix the lint error

tqchen avatar Apr 09 '24 18:04 tqchen