Why balance according to batch size in a technical report?

Open nannaer opened this issue 6 months ago • 0 comments

Thank you very much for your contributions to the RTP-LLM inference engine! I have a question about the load balancing strategy in the technical report.

In the DeepEP framework, Dispatch and Combine can be divided into SEND and RECV stages. SEND is submitted asynchronously, which means that only the completion of RECV needs to be waited for. Therefore, the latency of Dispatch and Combine is proportional to the number of RECV tokens. I have tried to calculate the impact of different RANK's RECV token numbers when the batch size is not balanced for different RANKs. In the scenario of unbalanced load, assuming that the batch size per rank of the 8 GPUs in machine 1 is 384, and the batch size per rank of the 8 GPUs in machine 2 is 64. It can be seen from the figure that the performance loss in the case of unbalanced load is only (220-210)/210=4.7%, which is very small. Therefore, I think it is not appropriate to balance the load according to the batch size.

I would also like to ask how much performance improvement there is in the actual environment when balancing according to batch size and when balancing according to KV Cache. @feifei14119 @draganmladjenovic @valarLip

Jun 23 '25 16:06 nannaer