torchtune [WIP] Qwen3 MoE support

This PR adds support for Qwen3 MoE (30B-A3B and 235B-A22B) models. Loss looked reasonable from a simple test with 30B-A3B on the Alpaca dataset.

TODO:

[ ] Tensor/Expert parallel
[x] Test 235B model
[x] Verify loss curves against HF implementation
[x] LoRA support
[ ] Documentation
[ ] Tests

Jun 12 '25 06:06 intervitens

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2820

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Jun 12 '25 06:06 pytorch-bot[bot]

@intervitens Hi，Does it support training with fp8-compatible checkpoints?

Aug 06 '25 03:08 dz1iang

Hello, I see the training process prompts: "Saving Qwen3 MoE adapter weights to PEFT format is not supported, saving to torchtune format instead." May I ask how to obtain the Hugging Face (hf) checkpoint? Is there any code example for reference? Additionally, I only set lora_attn_modules: ['q_proj', 'v_proj', 'output_proj'], apply_lora_to_mlp: False, and apply_lora_to_output: False—can the tune_to_peft_adapter_weights logic be used in this case? Thank you.

Aug 13 '25 10:08 dz1iang

Why not merge this in?

Nov 18 '25 00:11 cinjon

Has anyone had success compiling the moe here and training?

Dec 04 '25 22:12 cinjon