tutel icon indicating copy to clipboard operation
tutel copied to clipboard

Tutel as an MoE backend in Nanotron for Qwen3-MoE 15B (128 experts, top-k=8)

Open hahahaahaa opened this issue 3 months ago • 3 comments

Hello :) I’d like to use Tutel as the MoE layer implementation in Nanotron to train a Qwen3-MoE 15B model from scratch with 128 experts and top-k = 8.

Cluster with SLURM: up to 256 nodes

GPUs: 4× A100 64 GB per node.

Goal: scale across 32–1,024 GPUs with EP/TP/DP/PP.

  1. Is a similiar configuration (maybe for wen3-30B-A3B) supported out-of-the-box, or are patches required to enable a Tutel backend (e.g., a moe_config.backend: tutel switch)?

  2. Recommended parallelism layout (EP/TP/PP/DP) for 32–1,024 GPUs with 128 experts and k=8. Any guidance on expert placement to minimize all-to-all across nodes?

Many thanks!

hahahaahaa avatar Sep 17 '25 14:09 hahahaahaa

May I know if "Nanotron" is still active? I try deploying it for Tutel integration, but the nanotron fails even under uv environment. Is there any docker environment that is compatible to run it?

ghostplant avatar Sep 18 '25 00:09 ghostplant

Nanotron was running smoothly for training a dense MoE3 model (in various sizes), but I'm running into issues with the MoE version. Would you recommend switching to plain PyTorch with Tutel? Thanks :)

hahahaahaa avatar Sep 18 '25 07:09 hahahaahaa

I’m interested too in this topic

imthebilliejoe avatar Sep 22 '25 10:09 imthebilliejoe