4 node prefill hang
normal combine para: num_sms = 24 nvl_chunk_size = 1 nvl_buffer_size = 512 rdma_chunk_size = 8 rdma_buffer_size = 128
For my experiment setup: Four nodes, each with 8 * H20 GPU and 8 * NIC, IB
the logs:
DeepEP combine RDMA receiver timeout, channel: 5, RDMA: 2, nvl: 7, src RDMA: 1, tail: 576, waiting: 4057, expect: 580 DeepEP combine RDMA receiver timeout, channel: 5, RDMA: 2, nvl: 7, src RDMA: 1, tail: 576, waiting: 4054, expect: 577 DeepEP combine RDMA receiver timeout, channel: 5, RDMA: 2, nvl: 7, src RDMA: 1, tail: 576, waiting: 4059, expect: 582 DeepEP combine RDMA receiver timeout, channel: 5, RDMA: 2, nvl: 7, src RDMA: 1, tail: 576, waiting: 4053, expect: 576 DeepEP combine RDMA receiver timeout, channel: 5, RDMA: 2, nvl: 7, src RDMA: 1, tail: 576, waiting: 4055, expect: 578 DeepEP combine RDMA receiver timeout, channel: 5, RDMA: 2, nvl: 7, src RDMA: 1, tail: 576, waiting: 4058, expect: 581 DeepEP combine RDMA receiver timeout, channel: 5, RDMA: 2, nvl: 7, src RDMA: 1, tail: 576, waiting: 4056, expect: 579 DeepEP combine forwarder (RDMA check) timeout, channel: 5, RDMA: 2, nvl: 0, dst RDMA: 1, head: 384, tail: 512, chunked: 8 DeepEP combine forwarder (RDMA check) timeout, channel: 8, RDMA: 2, nvl: 0, dst RDMA: 1, head: 416, tail: 544, chunked: 8 DeepEP combine forwarder (RDMA check) timeout, channel: 6, RDMA: 2, nvl: 0, dst RDMA: 1, head: 416, tail: 544, chunked: 8 DeepEP combine RDMA receiver timeout, channel: 5, RDMA: 2, nvl: 5, src RDMA: 1, tail: 624, waiting: 4097, expect: 633 DeepEP combine RDMA receiver timeout, channel: 5, RDMA: 2, nvl: 5, src RDMA: 1, tail: 624, waiting: 4094, expect: 630
Could you Analyse this issue. Thanks a lot!
Did you encounter this issue when running test_internode.py?
@sphish yes,
can i have your wechat ?@sphish
@kzlxd Okay, my WeChat ID is: Sphizzz.