Chenggang Zhao comments

Results 84 comments of


                                            Chenggang Zhao

Could normal kernel and ll kernel be executed in same process?

Yes, it is possible, and you don't have to do anything, it is fully automatic. `NVSHMEM_IB_ENABLE_IBGDA` only initialize the IBGDA configs at setup. But it has no effect for normal...

Can DeepEP run correctly in cudaGraph mode?

1. Yes, but only for the normal kernels; 2. Yes; If you want to drop tokens, you should perform at the gate (masking some `topk_idx` into `-1`), DeepEP supports ignoring...

Can DeepEP run correctly in cudaGraph mode?

> will the utilization still be that fast when gating selection is imbalanced The overall performance will be bound at the imbalanced rank. In the terms of the imbalanced rank...

Suggestions for Improving the Readability of DeepEP Code

Thank you for your thoughtful feedback and for taking the time to study DeepEP's codebase so thoroughly! I really appreciate your kind words about the engineering quality, and your suggestions...

Suggestions for Improving the Readability of DeepEP Code

BTW, we are also planning a full refactor (better performance, less SMs, better readability) maybe several months later :)

Suggestions for Improving the Readability of DeepEP Code

You can shared Chinese version in issues (new issue is also OK) as well (or a forked repo link or blog link) 👍🏻

About the number of messages chunked in IBGDA

Assuming the message size (maximum ~KB level) is much smaller than the page size (i.e. `NVSHMEM_CUMEM_GRANULARITY`, normally very large >100 MB). So the worst case of getting local/remote key is,...

About the number of messages chunked in IBGDA

You can ignore that note as the while loop can proceed more than 3 chunks. But we tried some code simplication and optimizations here for the theretical maximum, but it...

Fix MMA promotion interval assertions

Anyone replies to this? I do think it's a serious bug, making `BLOCK_SIZE_K=256` made FP8 training loss curve much worse than non-FP8-fast-accum.

error when testing test_internode.sh deep_ep.cpp:83 'an illegal memory access was encountered'

Can you please set `test_ll_compatibility = False`? Testing normal and low-latency kernels separately may solve this deconstruction issue on your platform.