[Feature] Expert parallelism support
Checklist
- [x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 2. Please use English, otherwise it will be closed.
Motivation
Hi team, First of all thanks so much for such a great project. I am wondering if there is plan to support Expert Parallelism for MoE models?
Related resources
https://nvidia.github.io/TensorRT-LLM/advanced/expert-parallelism.html
https://github.com/sgl-project/sglang/blob/441c22db8cbcb005b5f005b991e8aa1a65d79bb6/python/sglang/srt/models/mixtral_quant.py#L86-L150
this is an early example
https://github.com/sgl-project/sglang/blob/441c22db8cbcb005b5f005b991e8aa1a65d79bb6/python/sglang/srt/models/mixtral_quant.py#L86-L150
this is an early example
@merrymercy Hi, any progress has been made on this issue? The example you provided previously didn't use FusedMOE but mlp. How can we enable Expert Parallel with the current Mixtral/DeepSeek-v2 after using FusedMOE? Do you have a modified example?
related #1970
related #1970
@merrymercy I see that this issue is mainly related to TP and DP. I noticed that the SGLang Q4 roadmap #1487 mentioned supporting this feature.
@liangzelang DP has already been merged(only for DeepSeek right now) https://github.com/sgl-project/sglang/pull/1970 and EP will be supported soon cc @ispobock
@liangzelang DP has already been merged(only for DeepSeek right now) #1970 and EP will be supported soon cc @ispobock
@zhyncs Does MoE-EP have any support? I have implemented MoE-EP.
Does MoE-EP have any support? I have implemented MoE-EP.
@xiaobochen123 We are going to implement it with a DP + EP approach for throughput gains. Currently, DP attention is implemented. Before we start the EP, some updates to the MoE codebase should be done.
I am interested in what kind of MoE-EP did you implement and what codebase did you use? How much are the performance gains compared to TP?
done by https://github.com/sgl-project/sglang/pull/2203