flux
flux copied to clipboard
[QUESTION]flux/src/moe_ag_scatter/ths_op/gemm_grouped_v3_ag_scatter.cc里的写stream很耗时,有优化方法吗?CU_CHECK(CUStreamWriteValue(this->cp_stream_intra_node,(CUdeviceptr)(ptr_offset(barrier_block.get(), src_rank * sizeof(int))),1,CU_STREAM_WRITE_VALUE_DEFAULT));
@wenlei-bao
这里触发了大页内存到gpu内存的拷贝,gpu和cpu的同步阻碍了其他kernel的预加载,能解决吗? 有没有替代的方法,比如写个flag?或者用回调函数避免等待?
It can be solved, but not a very easy one. have to
- put the shape fully into device side
- run no CUDA runtime API and use kernels for us
- make sure memory is pre-allocated
For large-batch size, the sync is not so bad. but we will record this issue.