DeepEP icon indicating copy to clipboard operation
DeepEP copied to clipboard

4 node prefill hang

Open kzlxd opened this issue 5 months ago • 4 comments

normal combine para: num_sms = 24 nvl_chunk_size = 1 nvl_buffer_size = 512 rdma_chunk_size = 8 rdma_buffer_size = 128

For my experiment setup: Four nodes, each with 8 * H20 GPU and 8 * NIC, IB

the logs:

DeepEP combine RDMA receiver timeout, channel: 5, RDMA: 2, nvl: 7, src RDMA: 1, tail: 576, waiting: 4057, expect: 580 DeepEP combine RDMA receiver timeout, channel: 5, RDMA: 2, nvl: 7, src RDMA: 1, tail: 576, waiting: 4054, expect: 577 DeepEP combine RDMA receiver timeout, channel: 5, RDMA: 2, nvl: 7, src RDMA: 1, tail: 576, waiting: 4059, expect: 582 DeepEP combine RDMA receiver timeout, channel: 5, RDMA: 2, nvl: 7, src RDMA: 1, tail: 576, waiting: 4053, expect: 576 DeepEP combine RDMA receiver timeout, channel: 5, RDMA: 2, nvl: 7, src RDMA: 1, tail: 576, waiting: 4055, expect: 578 DeepEP combine RDMA receiver timeout, channel: 5, RDMA: 2, nvl: 7, src RDMA: 1, tail: 576, waiting: 4058, expect: 581 DeepEP combine RDMA receiver timeout, channel: 5, RDMA: 2, nvl: 7, src RDMA: 1, tail: 576, waiting: 4056, expect: 579 DeepEP combine forwarder (RDMA check) timeout, channel: 5, RDMA: 2, nvl: 0, dst RDMA: 1, head: 384, tail: 512, chunked: 8 DeepEP combine forwarder (RDMA check) timeout, channel: 8, RDMA: 2, nvl: 0, dst RDMA: 1, head: 416, tail: 544, chunked: 8 DeepEP combine forwarder (RDMA check) timeout, channel: 6, RDMA: 2, nvl: 0, dst RDMA: 1, head: 416, tail: 544, chunked: 8 DeepEP combine RDMA receiver timeout, channel: 5, RDMA: 2, nvl: 5, src RDMA: 1, tail: 624, waiting: 4097, expect: 633 DeepEP combine RDMA receiver timeout, channel: 5, RDMA: 2, nvl: 5, src RDMA: 1, tail: 624, waiting: 4094, expect: 630

Could you Analyse this issue. Thanks a lot!

kzlxd avatar Jul 25 '25 02:07 kzlxd

Did you encounter this issue when running test_internode.py?

sphish avatar Jul 29 '25 08:07 sphish

@sphish yes,

Image

kzlxd avatar Aug 01 '25 05:08 kzlxd

can i have your wechat ?@sphish

kzlxd avatar Aug 01 '25 05:08 kzlxd

@kzlxd Okay, my WeChat ID is: Sphizzz.

sphish avatar Aug 05 '25 03:08 sphish