Chenggang Zhao
Chenggang Zhao
For more information, see #166.
> What changes will occur in the end-to-end latency of each RANK? Can it be estimated as max(Dispatch latency) + Expert Group Gemm latency + max(Combine latency)? In such an...
Can you contact via WeChat (LyricZ_THU)?
> The tokens for each expert isn't contiguous? Yes, and it is by design. Some MoE models will make one token averagely select more than one expert in a GPU...
> Is there such a possibility that without the modification of gil release, a dispatch timeout is going to happen because of deadlock? With GIL consideration, timeout only occurs with...
要不要先排除下环境的影响,用 Python 的脚本运行 test 之类的会有类似的问题吗? 还有就是你有没有 normal kernel 和 low-latency kernel 混合调用,如果混合调用了这个 buffer 是需要在中间被清零的(有一个专门的函数用来做这个);你可以在每次 launch 之前都检查下 `rdma_recv_count` 位置是不是都被清零的?(检查的方式是先 barrier 一下,然后检查,然后再 barrier 一下)。甚至也有可能别的 kernel 写越界了。 这里的逻辑是,最开始是 0,直到别的 rank 发过来让它不是 0 之后就知道真值了。
这两个值基本都可以随意变,改错了应该不会 illegal memory access;一般来说,groups 的数量少(2-4 这个范围),一个 group 里面的 warps 数量多(8-12);乘积
你可以用 compute-sanitizer + PYTORCH_NO_CUDA_MEMORY_CACHING=1 看看 memcheck 下别的 kernel 有没写越界,现在核心问题就是 buffer 被改了。
How many nodes and GPUs are you using? And the rank layout? Or can you ensure that every `[8k, 8k + 8)` GPUs are in the same NVLink domain? This...
> But we got crash when we try to initiate two Buffers with different modes, like this Oh, sorry. NVSHMEM can not be initialized twice. If you have an engine...