DeepEP icon indicating copy to clipboard operation
DeepEP copied to clipboard

Can deep_ep run on environments with more than 160 ranks?

Open ZhenguoYao1 opened this issue 2 months ago • 3 comments

I noticed a restriction in /csrc/deep_ep.cpp: EP_HOST_ASSERT(0 <= rank && rank < num_ranks && (num_ranks <= NUM_MAX_NVL_PEERS * NUM_MAX_RDMA_PEERS || low_latency_mode));

where NUM_MAX_NVL_PEERS = 8 and NUM_MAX_RDMA_PEERS = 20. This implies the rank count cannot exceed 160 (8*20).

I tested this on a 24-node cluster, and the assertion was triggered. Therefore, my question is: Does deep_ep actually support training on clusters with more than 20 nodes? Or is there any misunderstanding in my interpretation?

ZhenguoYao1 avatar Oct 21 '25 09:10 ZhenguoYao1

Typically, a training job might run on more than 20 nodes, but the EP group size within that job usually does not exceed 160.

sphish avatar Oct 23 '25 01:10 sphish

@ZhenguoYao1 Using gb200 support from https://github.com/fzyzcjy/DeepEP/tree/feat/dev_20250914, you can scale beyond 160.

goelayu avatar Oct 30 '25 20:10 goelayu

@goelayu Did you modify this https://github.com/fzyzcjy/DeepEP/blob/483f00af8490b0cc378823c6adecf9ea67602071/csrc/kernels/launch.cuh#L54 to scale up the ranks?

elvircrn avatar Nov 11 '25 21:11 elvircrn