How to perform inference MoE model with expert parallel
Hello, I want to perform inference on the HuggingFace MoE model Qwen1.5-MoE-A2.7B with expert parallelism using DeepSpeed in a multi-GPU environment. However, the official tutorials are not comprehensive enough, and despite reviewing the documentation, I still don't know how to proceed.
Could you please help me refine this request?
hello
I have same quesn: I came through this link, https://www.deepspeed.ai/tutorials/mixture-of-experts-inference/?utm_source=chatgpt.com#initializing-for-inference which have this code snipet. However it is not clear where does get_model comes from.
import deepspeed
import torch.distributed as dist
# Set expert-parallel size
world_size = dist.get_world_size()
expert_parallel_size = min(world_size, args.num_experts)
# create the MoE model
moe_model = get_model(model, ep_size=expert_parallel_size)
...
# Initialize the DeepSpeed-Inference engine
ds_engine = deepspeed.init_inference(moe_model,
mp_size=tensor_slicing_size,
dtype=torch.half,
moe_experts=args.num_experts,
checkpoint=args.checkpoint_path,
replace_with_kernel_inject=True,)
model = ds_engine.module
output = model('Input String')
I have same quesn: I came through this link, https://www.deepspeed.ai/tutorials/mixture-of-experts-inference/?utm_source=chatgpt.com#initializing-for-inference which have this code snipet. However it is not clear where does get_model comes from.
import deepspeed import torch.distributed as dist # Set expert-parallel size world_size = dist.get_world_size() expert_parallel_size = min(world_size, args.num_experts) # create the MoE model moe_model = get_model(model, ep_size=expert_parallel_size) ... # Initialize the DeepSpeed-Inference engine ds_engine = deepspeed.init_inference(moe_model, mp_size=tensor_slicing_size, dtype=torch.half, moe_experts=args.num_experts, checkpoint=args.checkpoint_path, replace_with_kernel_inject=True,) model = ds_engine.module output = model('Input String')
Hi, do you have solutions? I want to do the same thing: Running a Huggingface Model with Expert Parallel(EP), here is my experience:
It's very complicated. The MoE module in Deepspeed only implemented top1 Gating and top2 Gating. After reviewing the code, I think you need to modify these classes manually:
TopKGate(from deepspeed.moe.layer import TopKGate): You need to modify topkgate to enable top-4 Gating. Note the implementation is GShard implementation.MoE(from deepspeed.moe.layer import MoE): Modify this part to make sure your MoE layer takes the same arguments and behaves the same as the QwenMoE, especially the shared expert part.
So, you have to do a lot of things. Here are my suggestions:
- DeepseekV3 implemented EP manually, I have seen the code at: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py.
- Maybe FastMoE can help: https://github.com/laekov/fastmoe.
I have the same question,Did you sloved it? Thanks
Luckily, I have found a repo that support MoE EP+TP+PP training: ColossalAI, it now support Deepseek(V1 and V3), and Mixtrial 8x7B.
- Advantages: It can convert a HuggingFace MoE model into EP+TP+PP mode.
- Disadvantages: The convert script/logic is very very complicated, you may need some time dive into it if you want to apply a new MoE model. But it supports Deepseek and Mixtrial, so you can follow the examples.