flux [QUESTION]flux的性能比torch略慢一些？

你好，请教一下我跑 ./launch.sh test/python/gemm_rs/test_gemm_rs.py 16384 2048 16384 --dtype=bfloat16 --iters=10，输出 torch #1: gemm 0.200 ms, comm 0.405 ms, total 0.605 ms flux #1: gemm 0.194 ms, comm 0.609 ms, total 0.803 ms 我再跑./launch.sh tools/tune_gemm_rs.py --output_dir output 输出： GemmMeta(dtype=GemmDTypeConfig(a=BF16,b=BF16,c=BF16,d=BF16,acc=FP32,blockscale=FP32),arch=Sm90,comm_op=ReduceScatter,gemm_layout=RCR,impl=GemmV3,impl_spec=GemmV3Meta(fast_accum=0,block_scale=0),comm_spec=ReduceScatterMeta(fuse_reduction=0,comm_kind=IntraNode)) RuntimeConfig(m=16384,n=2048,k=4096,comm_spec=ReduceScatterRuntimeConfig(world_size=8,nnodes=1))

TopK=1 (0.842 ms): GemmHParams(impl_spec=GemmV3HParams(cluster_shape=(2,1,1),kernel_schedule=Cooperative),comm_spec=None,tile_shape=(128,256,64),gemm_kind=GemmDefault,mainloop_stage=0,raster_order=RasterHeuristic)
TopK=2 (1.16 ms): GemmHParams(impl_spec=GemmV3HParams(cluster_shape=(1,2,1),kernel_schedule=Cooperative),comm_spec=None,tile_shape=(128,256,64),gemm_kind=GemmDefault,mainloop_stage=0,raster_order=RasterHeuristic) 是不是这个shape跑flux不如torch，能不能继续调优呢？

Mar 28 '25 03:03 jinchen89

please provide more information, such as your hardware/software information.

but here N is too small and flux may not perform well. we fix it later.

Mar 31 '25 03:03 houqi

which N is too small? hardware/software information is as follows: torch = 2.3.1 cuda=12.3 NVIDIA H800 gpu Driver Version: 535.161.08 gpu mem:81559MiB one node 8 opus

Mar 31 '25 03:03 jinchen89