jinchen89 issues

Results 7 issues of


                                            jinchen89

[BUG]cannot compile project

./build.sh --arch 90 --nvshmem when I do this,report: ![Image](https://github.com/user-attachments/assets/8819c5cb-ba85-4c49-9a61-a6b848909b90)

[QUESTION]你好，我在nv H20上跑./launch.sh test/python/gemm_rs/test_gemm_rs.py 16384 4096 16384 --dtype=bfloat16 --iters=5报了这个错，H800上同样的镜像和命令没有问题？是H20有啥不一样吗？

hardware/software information is as follows: torch = 2.3.1 cuda=12.3 NVIDIA H20 gpu Driver Version: 535.161.08 gpu mem:97871MiB one node 8 opus

[QUESTION]flux/src/moe_ag_scatter/ths_op/gemm_grouped_v3_ag_scatter.cc你好，如果我需要在moe_ag的allgather之后对input_buffer做一个transpose(0,1),需要怎么修改这里的代码？需要改flux/src/moe_ag_scatter/sm90_gemm_array_warpspecialized_cooperative.hpp里面的代码吗？

**Your question** Ask a clear and concise question about Flux.

[QUESTION]你好，flux/test/python/moe_ag_scatter/test_moe_ag.py这个需要在8gpu的时候设置tp size=2需要怎么改？

我看flux/test/python/moe_gather_rs/test_moe_gather_rs.py里面可以随意设置tp size？我理解tp size可以和gpu数量不相等的，请不吝赐教，感谢！

[QUESTION]您好，请教一下如果我需要在moe_ag的allgather通信之后对ctx.inputs增加一个transpose(0,1)，需要修改哪处代码？

我理解allgather得到ctx.inputs后第一个维度应该是tpsize，如果直接reshape成第一个维度是ctx.nexperts_ep组内的专家数的话是不是不太合逻辑，是否需要先把组内专家数那个维度先转置到第一个维度？ flux/src/moe_ag_scatter/ths_op/gemm_grouped_v3_ag_scatter.cc 我能不能直接在这个文件的all_gather_all2all函数的最后对input_buffer做transpose，我测下来推理第二次就会挂掉？

[QUESTION]flux的性能比torch略慢一些？

你好，请教一下我跑 ./launch.sh test/python/gemm_rs/test_gemm_rs.py 16384 2048 16384 --dtype=bfloat16 --iters=10，输出 torch #1: gemm 0.200 ms, comm 0.405 ms, total 0.605 ms flux #1: gemm 0.194 ms, comm 0.609 ms, total 0.803 ms...

[QUESTION]flux/src/moe_ag_scatter/ths_op/gemm_grouped_v3_ag_scatter.cc里的写stream很耗时，有优化方法吗？CU_CHECK(CUStreamWriteValue(this->cp_stream_intra_node,(CUdeviceptr)(ptr_offset(barrier_block.get(), src_rank * sizeof(int))),1,CU_STREAM_WRITE_VALUE_DEFAULT));

@wenlei-bao 这里触发了大页内存到gpu内存的拷贝，gpu和cpu的同步阻碍了其他kernel的预加载，能解决吗？有没有替代的方法，比如写个flag?或者用回调函数避免等待？

enhancement