tutel Tutel as an MoE backend in Nanotron for Qwen3-MoE 15B (128 experts, top-k=8)

Hello :) I’d like to use Tutel as the MoE layer implementation in Nanotron to train a Qwen3-MoE 15B model from scratch with 128 experts and top-k = 8.

Cluster with SLURM: up to 256 nodes

GPUs: 4× A100 64 GB per node.

Goal: scale across 32–1,024 GPUs with EP/TP/DP/PP.

Is a similiar configuration (maybe for wen3-30B-A3B) supported out-of-the-box, or are patches required to enable a Tutel backend (e.g., a moe_config.backend: tutel switch)?
Recommended parallelism layout (EP/TP/PP/DP) for 32–1,024 GPUs with 128 experts and k=8. Any guidance on expert placement to minimize all-to-all across nodes?

Many thanks!

Sep 17 '25 14:09 hahahaahaa

May I know if "Nanotron" is still active? I try deploying it for Tutel integration, but the nanotron fails even under uv environment. Is there any docker environment that is compatible to run it?

Sep 18 '25 00:09 ghostplant

Nanotron was running smoothly for training a dense MoE3 model (in various sizes), but I'm running into issues with the MoE version. Would you recommend switching to plain PyTorch with Tutel? Thanks :)

Sep 18 '25 07:09 hahahaahaa

I’m interested too in this topic

Sep 22 '25 10:09 imthebilliejoe