openvino
openvino copied to clipboard
[GPU]qwen3 moe support
Support Qwen3 MoE model running with GPU plugin
Details:
- Fuse moe subgraph into single moe_expert op to decrease total ops number and improve compile_model and inference performance.
- moe_expert primitive execution stage:
- First token adopts onednn gemm kernels pipeline and optimized opencl kernel(gatther, scatter) to do moe execution, each expert is executed in serial.
- Second token adopts optimized opencl kernels(mlp_gate_up, mlp_down, softmax_topk, reduce) to do multiple-experts parallel execution.
- Moe weight of each layer is allocated in a single usm memory and create submemory from it for each expert's weights/scale/zp memory, which is helpful for second token's expert kernels parallel execution.
- Optimize key_cache and value_cache input.
- Only support moe: u4 weight, f16 scale, u4 zp and group_size=128, which is required by qwen3 MoE 30B model.
- Only support systolic gpu (A770/B580/ARL/LNL), doesn't support MTL, because first token need call onednn gemm kernel.
Moe fusion result
Original moe(contains 128 experts) exec graph:
With this PR, it will become one single moe_expert op:
TODO:
- [ ] Support more MoE patterns, current only verify and support qwen3 moe pattern.
- [ ] Integrate optimized cm kernel for second token moe
- [ ] Align cm kernel to use the same scale/zp layout with opencl kernel.
- [ ] Support more moe data type: u8 weight
- [ ] Support other subgroup size: 32, 64,256...
Tickets:
- CVS-166011, CVS-168901, CVS-169299
@praasz @mitruska do we plan to introduce some "internal" specification for ov internal ops? maybe not so formal as for official opset PagedAttention is covered by some presentation, for Rope we only have some comments in the code (as far as I know)
@praasz @mitruska do we plan to introduce some "internal" specification for ov internal ops? maybe not so formal as for official opset PagedAttention is covered by some presentation, for Rope we only have some comments in the code (as far as I know)
@itikhono Directory for internal ops specifications exists here: https://github.com/openvinotoolkit/openvino/tree/master/docs/articles_en/documentation/openvino-ir-format/operation-sets/operation-specs/internal Feel free to contribute.
@praasz @mitruska do we plan to introduce some "internal" specification for ov internal ops? maybe not so formal as for official opset PagedAttention is covered by some presentation, for Rope we only have some comments in the code (as far as I know)
@itikhono Directory for internal ops specifications exists here: https://github.com/openvinotoolkit/openvino/tree/master/docs/articles_en/documentation/openvino-ir-format/operation-sets/operation-specs/internal Feel free to contribute.
Added.
@CuriousPanCake could you take a look?
Hi, @riverlijunjie could you tell us who told you choose exactly this way?
- You are changing a common approach with a Constants to do... what?
- You are trying to reduce complexity by a graph by fusing a set of subgraphs into a single node. Like a function in other frameworks.
- Did you try to fuse it by a Loop + If + Slice/Gather? Results?
- Maybe we just need to introduce a new operation and it will solve the issue?
Maybe, if you do it in a plugin logic, without serialization, it might be accepted, but right now it raise a lot of questions.
Hi, @riverlijunjie could you tell us who told you choose exactly this way?
- You are changing a common approach with a Constants to do... what?
- You are trying to reduce complexity by a graph by fusing a set of subgraphs into a single node. Like a function in other frameworks.
- Did you try to fuse it by a Loop + If + Slice/Gather? Results?
- Maybe we just need to introduce a new operation and it will solve the issue?
Maybe, if you do it in a plugin logic, without serialization, it might be accepted, but right now it raise a lot of questions.
@gkrivor I just sent an email, let's discuss them in the email.
This PR will be closed in a week because of 2 weeks of no activity.
This PR will be closed in a week because of 2 weeks of no activity.
This PR will be closed in a week because of 2 weeks of no activity.
This PR was closed because it has been stalled for 2 week with no activity.