Chenggang Zhao comments

Results 84 comments of


                                            Chenggang Zhao

Where do dispatch and combine need to be synchronized?

For more information, see #166.

Where do dispatch and combine need to be synchronized?

> What changes will occur in the end-to-end latency of each RANK? Can it be estimated as max(Dispatch latency) + Expert Group Gemm latency + max(Combine latency)? In such an...

Where do dispatch and combine need to be synchronized?

Can you contact via WeChat (LyricZ_THU)?

output tokens in intranode::dispatch for each expert isn't packed?

> The tokens for each expert isn't contiguous? Yes, and it is by design. Some MoE models will make one token averagely select more than one expert in a GPU...

question about the modification of "pybind11::gil_scoped_release release"

> Is there such a possibility that without the modification of gil release, a dispatch timeout is going to happen because of deadlock? With GIL consideration, timeout only occurs with...

illegal memory on low_latency_dispatch when test dataset

要不要先排除下环境的影响，用 Python 的脚本运行 test 之类的会有类似的问题吗？还有就是你有没有 normal kernel 和 low-latency kernel 混合调用，如果混合调用了这个 buffer 是需要在中间被清零的（有一个专门的函数用来做这个）；你可以在每次 launch 之前都检查下 `rdma_recv_count` 位置是不是都被清零的？（检查的方式是先 barrier 一下，然后检查，然后再 barrier 一下）。甚至也有可能别的 kernel 写越界了。这里的逻辑是，最开始是 0，直到别的 rank 发过来让它不是 0 之后就知道真值了。

Chenggang Zhao

Where do dispatch and combine need to be synchronized?

Where do dispatch and combine need to be synchronized?

Where do dispatch and combine need to be synchronized?

output tokens in intranode::dispatch for each expert isn't packed?

question about the modification of "pybind11::gil_scoped_release release"

illegal memory on low_latency_dispatch when test dataset

illegal memory on low_latency_dispatch when test dataset

illegal memory on low_latency_dispatch when test dataset

Could normal kernel and ll kernel be executed in same process?

Could normal kernel and ll kernel be executed in same process?