flux icon indicating copy to clipboard operation
flux copied to clipboard

[QUESTION]flux/src/moe_ag_scatter/ths_op/gemm_grouped_v3_ag_scatter.cc里的写stream很耗时,有优化方法吗?CU_CHECK(CUStreamWriteValue(this->cp_stream_intra_node,(CUdeviceptr)(ptr_offset(barrier_block.get(), src_rank * sizeof(int))),1,CU_STREAM_WRITE_VALUE_DEFAULT));

Open jinchen89 opened this issue 6 months ago • 1 comments

@wenlei-bao

Image Image

这里触发了大页内存到gpu内存的拷贝,gpu和cpu的同步阻碍了其他kernel的预加载,能解决吗? 有没有替代的方法,比如写个flag?或者用回调函数避免等待?

jinchen89 avatar Jun 05 '25 07:06 jinchen89

It can be solved, but not a very easy one. have to

  • put the shape fully into device side
  • run no CUDA runtime API and use kernels for us
  • make sure memory is pre-allocated

For large-batch size, the sync is not so bad. but we will record this issue.

houqi avatar Jul 23 '25 22:07 houqi