tvm
tvm copied to clipboard
[Disco] Add MSCCLPP initialization along side NCCL
- Use CCL type traits to share common code between NCCL and MSCCLPP API invocations in disco
- Add bench to validate results and compare various supported CCL approaches for cuda.
Aggregated profiling results over the sweep of transfer sizes introduced in the above mentioned bench 2**(12 -> 24).
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
33.2 277809887 4048 68628.9 45504.0 15104 25069360 545729.8 ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm *, unsigned long, ncclWork *)
24.3 203015590 4040 50251.4 48256.0 12992 315934 29470.2 void tensorrt_llm::twoShotAllReduceKernel<__half, (int)8>(tensorrt_llm::AllReduceParams)
20.9 174549284 4040 43205.3 39440.0 3711 1085846 52387.0 void tensorrt_llm::oneShotAllReduceKernel<__half, (int)8>(tensorrt_llm::AllReduceParams)
20.7 173275472 4040 42890.0 40112.0 5375 750653 42864.2 void tvm::runtime::allreduce_simple<__half>(mscclpp::SmChannelDeviceHandle *, const T1 *, T1 *, voi…
0.8 6985121 120 58209.3 55871.5 9695 158239 35854.8 ncclDevKernel_AllGather_RING_LL(ncclDevComm *, unsigned long, ncclWork *)
I noted significant variance between runs, so e2e or use of cuda graph launch for synchronization could help give a clearer picture.
@csullivan please fix the lint error