Fix racing condition in large batch size

Open fzyzcjy opened this issue 2 months ago • 1 comments

In internode_ll.cu, the logic is:

                lane_id == 0 ? atomic_add_release_global(atomic_finish_counter_per_expert + dst_expert_idx, 1) : 0;
...
                atomic_add_release_global(atomic_finish_counter_per_expert + i, FINISHED_SUM_TAG);
...
                atomic_add_release_global(atomic_finish_counter_per_expert + i, FINISHED_SUM_TAG - sum);
...
        while (ld_acquire_global(atomic_finish_counter_per_expert + responsible_expert_idx) != FINISHED_SUM_TAG * 2);

In my naive understanding, if the batch size is >1024 (which is possible since a extreme case can have bs 1400), then it is possible that after sending the first 1024 tokens, we already reach the last while condition (since FINISHED_SUM_TAG=1024), and we send signals prematurely, causing bugs.

Oct 02 '25 11:10 fzyzcjy

This fix has been merged into hybrid-ep branch within https://github.com/deepseek-ai/DeepEP/pull/501.

Nov 25 '25 04:11 shifangx