[WIP] Qwen3 MoE support
This PR adds support for Qwen3 MoE (30B-A3B and 235B-A22B) models. Loss looked reasonable from a simple test with 30B-A3B on the Alpaca dataset.
TODO:
- [ ] Tensor/Expert parallel
- [x] Test 235B model
- [x] Verify loss curves against HF implementation
- [x] LoRA support
- [ ] Documentation
- [ ] Tests
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2820
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
This comment was automatically generated by Dr. CI and updates every 15 minutes.
@intervitens Hi,Does it support training with fp8-compatible checkpoints?
Hello, I see the training process prompts: "Saving Qwen3 MoE adapter weights to PEFT format is not supported, saving to torchtune format instead."
May I ask how to obtain the Hugging Face (hf) checkpoint? Is there any code example for reference?
Additionally, I only set lora_attn_modules: ['q_proj', 'v_proj', 'output_proj'], apply_lora_to_mlp: False, and apply_lora_to_output: False—can the tune_to_peft_adapter_weights logic be used in this case? Thank you.
Why not merge this in?
Has anyone had success compiling the moe here and training?