TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

[Call for contributions]The development plan of large-scale EP support in TensorRT-LLM

Open juney-nvidia opened this issue 6 months ago • 1 comments

Big thanks to the DeepSeek team for their awesome works! Recently, large-scale fine-grained MoE models have been gaining popularity, but they also bring new optimization challenges (and opportunities) for LLM inference systems. One key technique to make models like DeepSeek V3/R1 run efficiently is large-scale EP (Expert Parallelism) – it not only leverages aggregated memory bandwidth to reduce latency but also helps maximize compute utilization.

Getting large-scale EP to work well isn't easy, and we really appreciate the DeepSeek team sharing their insights and optimization tricks through both their tech report and open-source code(DeepEP and EBLP). Shoutout to the SGLang team too, who recently did great work implementing large-scale EP using DeepSeek's components plus their own innovations!

On the TensorRT-LLM side, we've been working on large-scale EP support for a while. Our approach might differ slightly from other solutions – we're particularly focused on supporting NVIDIA's latest hardware (like GB200) as well as other architectures (B200, Hopper, etc.).

We're also putting extra effort into designing an end-to-end system that handles both large-scale EP execution and dynamic workload balancing to adapt to real-time traffic changes, making deployment smoother for users. To be clear, we don't think these ideas are unique to TensorRT-LLM – in fact, we're pretty sure teams like DeepSeek have already implemented similar approaches in their internal systems (judging from their published tech report). We've learned a ton from DeepSeek's paper and code, and we're grateful they've shared their work with the community!

Motivated by DeepSeek's work, and also to make TensorRT-LLM technical execution more transparent, also to provide a channel for the community to get engaged into TensorRT-LLM core development work at the early stage, we are now sharing the concrete plan of supporting large-scale EP in TensorRT-LLM to the community to get early feedback, your comments/suggestion and contributions are highly appreciated:

  • Communication component
    • Customized MoE A2A communication kernels for large-scale EP
      • [Done] GB200 support @dongxuy04
      • [Ongoing] B200/Hopper support @Tailing Yuan @jhaotingc @Meng Wang
        • Being investigated now, for this specific area, there are great work from DeepSeek(DeepEP work) and Perplexity(PPLX work), and based on our current limited understanding, they both have pros and cons, so we are not rushing with the integration, rather we are doing more technical due-diligence to figure out a reasonable technical solution.
  • EP balancer component(most of the work for this component can be applied to multiple GPU architectures)
    • [Ongoing] Statistics and Routing kernels @dongxuy04
    • [Ongoing] Remapping - synchronization logics @dongxuy04
    • [Ongoing] Replication and placement logics @dongxuy04
    • [Ongoing] FusedMoE module changes @wm2012011492
    • [Ongoing] Experts loading and sharing @dongxuy04
  • E2E workflow integration
  • Performance tuning/analysis/optimization

To make the community easier to understand what we are doing now and what we plan to do, here is the high-level design overview done by @dongxuy04 (thanks for Dongxu's great technical work to make the current design):

Image

We are also considering initiating a detailed design review&discussion with the community if there are enough interests, thus to help the community understand more of the current plan to encourage the community engagement.

Thanks

The TensorRT-LLM Engineering Team

juney-nvidia avatar May 07 '25 14:05 juney-nvidia

Great to hear this! @juney-nvidia, do we have a plan to setup EP partition analytic models ?

It is generally believed that EP should be evenly distributed to each nodes and each gpu. However, cost of all-to-all operation will be different team to team due to bandwith budget limitation.

For example, the normal configuation of GPU:NIC = 1:1, the bandwith of EP operation will be slightly increased with EP numbers. This may not be true for other configutation if the bandwith becomes bottlenect at some number of EP.

This reminds us , that we need, to setup a analytic model or empicial data sheet for this purpose and I guess it will be easy for team being familiar with setup of pretraining infra.

I guess 800 Gb NVL system could be fully studied by NV team, so the community may be interested in:

  • 400 Gb RoCE vs Infiniband (IB)
  • 3.2 Tb aggregation network performance with EP operations, best number of EP estimation
  • models and impact with general NIC card study

I would like help to work on the relevant issues (with intra-node test).

Great to hear this! @juney-nvidia, do we have a plan to setup EP partition analytic models ?

It is generallly believed that EP should be evenly distributed to each nodes and each gpu. However, cost of all-to-all operation will be different team to team due to bandwith budget limitation.

For example, the normal configuation of GPU:NIC = 1:1, the bandwith of EP operation will be slightly increased with EP numbers. This may not be true for other configutation if the bandwith becomes bottlenect at some number of EP.

This reminds us , that we need, to setup a analyitic model or empicial data sheet for this purpose and I guess it will be easy for team familiar with setup of pretraining infra.

I guess 800 Gb NVL system could be fully studied by NV team, so the community may be interested in:

  • 400 Gb RoCE vs Infiniband (IB)
  • 3.2 Tb aggregation network performance with EP operations, best number of EP estimation
  • models and impact with general NIC card study

I would like help to work on the relevant issues (with intra-node test).

@yiakwy-xpu-ml-framework-team

Sorry for the delayed response.

Let me confirm my understanding, so you have interest to build up a cost model or analytical model to guide the best EP partition strategy given a concrete model architecture(here is DS R1) and the network HW bandwidth(including both intra and inter-node bandwidth), correct?

June

juney-nvidia avatar May 20 '25 14:05 juney-nvidia

Is there any plan to support two-batch overlap? or any other solutions, to hide moe communication time, especially for hardware platform with only rdma-based node-to-node communications.

chongxing avatar Jul 07 '25 12:07 chongxing