DeepEP
DeepEP copied to clipboard
Fix racing condition in large batch size
In internode_ll.cu, the logic is:
lane_id == 0 ? atomic_add_release_global(atomic_finish_counter_per_expert + dst_expert_idx, 1) : 0;
...
atomic_add_release_global(atomic_finish_counter_per_expert + i, FINISHED_SUM_TAG);
...
atomic_add_release_global(atomic_finish_counter_per_expert + i, FINISHED_SUM_TAG - sum);
...
while (ld_acquire_global(atomic_finish_counter_per_expert + responsible_expert_idx) != FINISHED_SUM_TAG * 2);
In my naive understanding, if the batch size is >1024 (which is possible since a extreme case can have bs 1400), then it is possible that after sending the first 1024 tokens, we already reach the last while condition (since FINISHED_SUM_TAG=1024), and we send signals prematurely, causing bugs.
This fix has been merged into hybrid-ep branch within https://github.com/deepseek-ai/DeepEP/pull/501.