Shixuan Zheng
Results
2
comments of
Shixuan Zheng
Just change the code in launch.sh nproc_per_node=2
It turned out that this implementation leads to worse performance than no-fusion: ./launch.sh test/python/gemm_rs/test_gemm_rs.py 8192 12288 8192 --dtype=float16 --iters=10 torch #0: gemm 0.557 ms, comm 1.009 ms, total 1.566 ms...