Tutel as an MoE backend in Nanotron for Qwen3-MoE 15B (128 experts, top-k=8)
Hello :) I’d like to use Tutel as the MoE layer implementation in Nanotron to train a Qwen3-MoE 15B model from scratch with 128 experts and top-k = 8.
Cluster with SLURM: up to 256 nodes
GPUs: 4× A100 64 GB per node.
Goal: scale across 32–1,024 GPUs with EP/TP/DP/PP.
-
Is a similiar configuration (maybe for wen3-30B-A3B) supported out-of-the-box, or are patches required to enable a Tutel backend (e.g., a moe_config.backend: tutel switch)?
-
Recommended parallelism layout (EP/TP/PP/DP) for 32–1,024 GPUs with 128 experts and k=8. Any guidance on expert placement to minimize all-to-all across nodes?
Many thanks!
May I know if "Nanotron" is still active? I try deploying it for Tutel integration, but the nanotron fails even under uv environment. Is there any docker environment that is compatible to run it?
Nanotron was running smoothly for training a dense MoE3 model (in various sizes), but I'm running into issues with the MoE version. Would you recommend switching to plain PyTorch with Tutel? Thanks :)
I’m interested too in this topic