Chenggang Zhao
Chenggang Zhao
Yes, it is possible, and you don't have to do anything, it is fully automatic. `NVSHMEM_IB_ENABLE_IBGDA` only initialize the IBGDA configs at setup. But it has no effect for normal...
1. Yes, but only for the normal kernels; 2. Yes; If you want to drop tokens, you should perform at the gate (masking some `topk_idx` into `-1`), DeepEP supports ignoring...
> will the utilization still be that fast when gating selection is imbalanced The overall performance will be bound at the imbalanced rank. In the terms of the imbalanced rank...
Thank you for your thoughtful feedback and for taking the time to study DeepEP's codebase so thoroughly! I really appreciate your kind words about the engineering quality, and your suggestions...
BTW, we are also planning a full refactor (better performance, less SMs, better readability) maybe several months later :)
You can shared Chinese version in issues (new issue is also OK) as well (or a forked repo link or blog link) 👍🏻
Assuming the message size (maximum ~KB level) is much smaller than the page size (i.e. `NVSHMEM_CUMEM_GRANULARITY`, normally very large >100 MB). So the worst case of getting local/remote key is,...
You can ignore that note as the while loop can proceed more than 3 chunks. But we tried some code simplication and optimizations here for the theretical maximum, but it...
Anyone replies to this? I do think it's a serious bug, making `BLOCK_SIZE_K=256` made FP8 training loss curve much worse than non-FP8-fast-accum.
Can you please set `test_ll_compatibility = False`? Testing normal and low-latency kernels separately may solve this deconstruction issue on your platform.