Shixuan Zheng

Results 2 comments of Shixuan Zheng

It turned out that this implementation leads to worse performance than no-fusion: ./launch.sh test/python/gemm_rs/test_gemm_rs.py 8192 12288 8192 --dtype=float16 --iters=10 torch #0: gemm 0.557 ms, comm 1.009 ms, total 1.566 ms...