[QUESTION]flux/src/moe_ag_scatter/ths_op/gemm_grouped_v3_ag_scatter.cc里的写stream很耗时，有优化方法吗？CU_CHECK(CUStreamWriteValue(this->cp_stream_intra_node,(CUdeviceptr)(ptr_offset(barrier_block.get(), src_rank * sizeof(int))),1,CU_STREAM_WRITE_VALUE_DEFAULT));

Open jinchen89 opened this issue 6 months ago • 1 comments

@wenlei-bao

这里触发了大页内存到gpu内存的拷贝，gpu和cpu的同步阻碍了其他kernel的预加载，能解决吗？有没有替代的方法，比如写个flag?或者用回调函数避免等待？

Jun 05 '25 07:06 jinchen89

It can be solved, but not a very easy one. have to

For large-batch size, the sync is not so bad. but we will record this issue.

Jul 23 '25 22:07 houqi