DeepSpeed How to perform inference MoE model with expert parallel

Hello, I want to perform inference on the HuggingFace MoE model Qwen1.5-MoE-A2.7B with expert parallelism using DeepSpeed in a multi-GPU environment. However, the official tutorials are not comprehensive enough, and despite reviewing the documentation, I still don't know how to proceed.

Could you please help me refine this request?

Dec 18 '24 13:12 Guodanding

hello

Dec 20 '24 02:12 Guodanding

I have same quesn: I came through this link, https://www.deepspeed.ai/tutorials/mixture-of-experts-inference/?utm_source=chatgpt.com#initializing-for-inference which have this code snipet. However it is not clear where does get_model comes from.

import deepspeed
import torch.distributed as dist

# Set expert-parallel size
world_size = dist.get_world_size()
expert_parallel_size = min(world_size, args.num_experts)

# create the MoE model
moe_model = get_model(model, ep_size=expert_parallel_size)
...

# Initialize the DeepSpeed-Inference engine
ds_engine = deepspeed.init_inference(moe_model,
                                     mp_size=tensor_slicing_size,
                                     dtype=torch.half,
                                     moe_experts=args.num_experts,
                                     checkpoint=args.checkpoint_path,
                                     replace_with_kernel_inject=True,)
model = ds_engine.module
output = model('Input String')

Jan 03 '25 02:01 delock

I have same quesn: I came through this link, https://www.deepspeed.ai/tutorials/mixture-of-experts-inference/?utm_source=chatgpt.com#initializing-for-inference which have this code snipet. However it is not clear where does get_model comes from.

import deepspeed
import torch.distributed as dist

# Set expert-parallel size
world_size = dist.get_world_size()
expert_parallel_size = min(world_size, args.num_experts)

# create the MoE model
moe_model = get_model(model, ep_size=expert_parallel_size)
...

# Initialize the DeepSpeed-Inference engine
ds_engine = deepspeed.init_inference(moe_model,
                                     mp_size=tensor_slicing_size,
                                     dtype=torch.half,
                                     moe_experts=args.num_experts,
                                     checkpoint=args.checkpoint_path,
                                     replace_with_kernel_inject=True,)
model = ds_engine.module
output = model('Input String')

Hi, do you have solutions? I want to do the same thing: Running a Huggingface Model with Expert Parallel(EP), here is my experience:

It's very complicated. The MoE module in Deepspeed only implemented top1 Gating and top2 Gating. After reviewing the code, I think you need to modify these classes manually:

TopKGate (from deepspeed.moe.layer import TopKGate): You need to modify topkgate to enable top-4 Gating. Note the implementation is GShard implementation.
MoE(from deepspeed.moe.layer import MoE): Modify this part to make sure your MoE layer takes the same arguments and behaves the same as the QwenMoE, especially the shared expert part.

So, you have to do a lot of things. Here are my suggestions:

DeepseekV3 implemented EP manually, I have seen the code at: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py.
Maybe FastMoE can help: https://github.com/laekov/fastmoe.

Feb 25 '25 07:02 Phoenix-Shen

I have the same question，Did you sloved it? Thanks

Mar 29 '25 11:03 jiangaojie

Luckily, I have found a repo that support MoE EP+TP+PP training: ColossalAI, it now support Deepseek(V1 and V3), and Mixtrial 8x7B.

Advantages: It can convert a HuggingFace MoE model into EP+TP+PP mode.
Disadvantages: The convert script/logic is very very complicated, you may need some time dive into it if you want to apply a new MoE model. But it supports Deepseek and Mixtrial, so you can follow the examples.

Apr 10 '25 02:04 Phoenix-Shen