NeMo Is Qwen3 pretraining architectural features fully supported now?

Is Qwen3 pretraining architectural features fully supported now?

Hi Team,

Thanks a lot for your excellent work!

Is Qwen3 pretraining architectural features fully supported now?

Could you please provide an architectural feature list with support status?

Thanks again!

Jun 10 '25 20:06 tjoymeed

Hi, we support general pretraining (without reasoning or long context extension), as well as full and parameter-efficient finetuning.

Jun 10 '25 23:06 cuichenx

What about MoE pretraining?

Jun 11 '25 00:06 tjoymeed

All Qwen 3 variants are supported, including 6 dense models and 2 MoE models.

Jun 11 '25 08:06 cuichenx

The Qwen 3 pretraining actually has 3 stages:

pretraining for 4096 context length;
pretraining for enhancing reasoning;
pretraining for extending to long context.

So NeMo currently supports 1 (all architectural features) not 2 or 3?

Thanks!

Jun 11 '25 09:06 tjoymeed

Yes, what you described is correct

Jun 11 '25 11:06 cuichenx

Okay, when will stage 3 pretraining be fully supported? Thanks!

Jun 11 '25 21:06 tjoymeed

All Qwen 3 variants are supported, including 6 dense models and 2 MoE models.

Could you please tell me where to find the recipe for Qwen3 MoE pretraining?

Jun 11 '25 21:06 tjoymeed

We're working on better long context training support right now, but I don't have any near term ETA to share with you at this time.

Qwen3 MoE recipes can be found here: https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/recipes/qwen3_30b_a3b.py#L55 https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/recipes/qwen3_235b_a22b.py#L55

Jun 11 '25 21:06 cuichenx

Great! Thanks a lot! Could you please enlighten me a bit on this: how to modify the recipe for qwen3_30b_a3b into pretraining from scratch a qwen3_12b-a1b? Thanks again!

Jun 12 '25 18:06 tjoymeed

Okay, when will stage 3 pretraining be fully supported? Thanks!

Is it coming out soon? Anxiously awaiting it...

PAI-Megaton-Patch has the stage 3. Maybe there is a way to integrate the two together?

Jun 12 '25 19:06 tjoymeed

Great! Thanks a lot! Could you please enlighten me a bit on this: how to modify the recipe for qwen3_30b_a3b into pretraining from scratch a qwen3_12b-a1b? Thanks again!

There is no definitive answer to this, but you can check out the difference between 235b_a22b and 30b_a3b -- these parameters are downsized: num_layers, hidden_size, num_attention_heads, moe_ffn_hidden_size.

Of course it's not guaranteed that downsizing these further to create 12b-a1b would create a model that still converges. Deep learning is an art :)

Okay, when will stage 3 pretraining be fully supported? Thanks!

Is it coming out soon? Anxiously awaiting it...

PAI-Megaton-Patch has the stage 3. Maybe there is a way to integrate the two together?

Sorry, I don't have any intel on when it will come out. I'll let you know once I know more.

Jun 12 '25 21:06 cuichenx

Hi, we are targeting support YaRN and other long context feature next release, which is NeMo 25.09. Currently it is not supported

Jun 12 '25 23:06 BoxiangW

Thanks! That means Sept of 2025? If so then that's too late. Could you please give some pointers regarding integrating PAI-Megaton-Patch's stage 3 pretraining feature into Nemo-Megatron, since PAI-Megaton-Patch has the stage 3 already? Thanks again!

Jun 12 '25 23:06 tjoymeed

Yes, Sept of 2025. We're not familiar with PAI-Megaton-Patch. You can ask in that repo for pointers.

Jun 12 '25 23:06 cuichenx

We're working on better long context training support right now, but I don't have any near term ETA to share with you at this time.

Qwen3 MoE recipes can be found here: https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/recipes/qwen3_30b_a3b.py#L55 https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/recipes/qwen3_235b_a22b.py#L55

Thanks a lot!

Are the features of this architecture fully supported in Nemo-Megatron now, e.g. the Global Balancing Routing, etc.?

Moreover, do you have statistics about the token efficiency and throughput numbers for pretraining from scratch the Qwen3-30B-A3B architecture? Do you have test results? Esp. when it comes active expert size k=8, the compute requirements is 8-fold or not?

Thanks again!

Jun 17 '25 06:06 tjoymeed

I used:

enroot import docker://nvcr.io/nvidia/nemo:25.04,
enroot import docker://nvcr.io/nvidia/nemo:dev

they both don't have qwen3 recipes inside.

I then tried: enroot import docker://nvcr.io/nvidia/nemo:nightly enroot import docker://nvcr.io/nvidia/nemo:latest

But got 404 errors:

[INFO] Querying registry for permission grant [INFO] Authenticating with user: $oauthtoken [INFO] Using credentials from file: /home/mp/.config/enroot/.credentials [INFO] Authentication succeeded [INFO] Fetching image manifest list [INFO] Fetching image manifest [ERROR] URL https://nvcr.io/v2/nvidia/nemo/manifests/nightly returned error code: 404 Not Found

What's wrong?

Jun 30 '25 18:06 tjoymeed

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Jul 31 '25 02:07 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

Aug 07 '25 02:08 github-actions[bot]