NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Is Qwen3 pretraining architectural features fully supported now?

Open tjoymeed opened this issue 6 months ago • 14 comments

Is Qwen3 pretraining architectural features fully supported now?

Hi Team,

Thanks a lot for your excellent work!

Is Qwen3 pretraining architectural features fully supported now?

Could you please provide an architectural feature list with support status?

Thanks again!

tjoymeed avatar Jun 10 '25 20:06 tjoymeed

Hi, we support general pretraining (without reasoning or long context extension), as well as full and parameter-efficient finetuning.

cuichenx avatar Jun 10 '25 23:06 cuichenx

What about MoE pretraining?

tjoymeed avatar Jun 11 '25 00:06 tjoymeed

All Qwen 3 variants are supported, including 6 dense models and 2 MoE models.

cuichenx avatar Jun 11 '25 08:06 cuichenx

The Qwen 3 pretraining actually has 3 stages:

  1. pretraining for 4096 context length;
  2. pretraining for enhancing reasoning;
  3. pretraining for extending to long context.

So NeMo currently supports 1 (all architectural features) not 2 or 3?

Thanks!

tjoymeed avatar Jun 11 '25 09:06 tjoymeed

Yes, what you described is correct

cuichenx avatar Jun 11 '25 11:06 cuichenx

Okay, when will stage 3 pretraining be fully supported? Thanks!

tjoymeed avatar Jun 11 '25 21:06 tjoymeed

All Qwen 3 variants are supported, including 6 dense models and 2 MoE models.

Could you please tell me where to find the recipe for Qwen3 MoE pretraining?

tjoymeed avatar Jun 11 '25 21:06 tjoymeed

We're working on better long context training support right now, but I don't have any near term ETA to share with you at this time.

Qwen3 MoE recipes can be found here: https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/recipes/qwen3_30b_a3b.py#L55 https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/recipes/qwen3_235b_a22b.py#L55

cuichenx avatar Jun 11 '25 21:06 cuichenx

Great! Thanks a lot! Could you please enlighten me a bit on this: how to modify the recipe for qwen3_30b_a3b into pretraining from scratch a qwen3_12b-a1b? Thanks again!

tjoymeed avatar Jun 12 '25 18:06 tjoymeed

Okay, when will stage 3 pretraining be fully supported? Thanks!

Is it coming out soon? Anxiously awaiting it...

PAI-Megaton-Patch has the stage 3. Maybe there is a way to integrate the two together?

tjoymeed avatar Jun 12 '25 19:06 tjoymeed

Great! Thanks a lot! Could you please enlighten me a bit on this: how to modify the recipe for qwen3_30b_a3b into pretraining from scratch a qwen3_12b-a1b? Thanks again!

There is no definitive answer to this, but you can check out the difference between 235b_a22b and 30b_a3b -- these parameters are downsized: num_layers, hidden_size, num_attention_heads, moe_ffn_hidden_size.

Of course it's not guaranteed that downsizing these further to create 12b-a1b would create a model that still converges. Deep learning is an art :)

Okay, when will stage 3 pretraining be fully supported? Thanks!

Is it coming out soon? Anxiously awaiting it...

PAI-Megaton-Patch has the stage 3. Maybe there is a way to integrate the two together?

Sorry, I don't have any intel on when it will come out. I'll let you know once I know more.

cuichenx avatar Jun 12 '25 21:06 cuichenx

Hi, we are targeting support YaRN and other long context feature next release, which is NeMo 25.09. Currently it is not supported

BoxiangW avatar Jun 12 '25 23:06 BoxiangW

Thanks! That means Sept of 2025? If so then that's too late. Could you please give some pointers regarding integrating PAI-Megaton-Patch's stage 3 pretraining feature into Nemo-Megatron, since PAI-Megaton-Patch has the stage 3 already? Thanks again!

tjoymeed avatar Jun 12 '25 23:06 tjoymeed

Yes, Sept of 2025. We're not familiar with PAI-Megaton-Patch. You can ask in that repo for pointers.

cuichenx avatar Jun 12 '25 23:06 cuichenx

We're working on better long context training support right now, but I don't have any near term ETA to share with you at this time.

Qwen3 MoE recipes can be found here: https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/recipes/qwen3_30b_a3b.py#L55 https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/recipes/qwen3_235b_a22b.py#L55

Thanks a lot!

Are the features of this architecture fully supported in Nemo-Megatron now, e.g. the Global Balancing Routing, etc.?

Moreover, do you have statistics about the token efficiency and throughput numbers for pretraining from scratch the Qwen3-30B-A3B architecture? Do you have test results? Esp. when it comes active expert size k=8, the compute requirements is 8-fold or not?

Thanks again!

tjoymeed avatar Jun 17 '25 06:06 tjoymeed

I used:

enroot import docker://nvcr.io/nvidia/nemo:25.04,
enroot import docker://nvcr.io/nvidia/nemo:dev

they both don't have qwen3 recipes inside.

I then tried: enroot import docker://nvcr.io/nvidia/nemo:nightly enroot import docker://nvcr.io/nvidia/nemo:latest

But got 404 errors:

[INFO] Querying registry for permission grant [INFO] Authenticating with user: $oauthtoken [INFO] Using credentials from file: /home/mp/.config/enroot/.credentials [INFO] Authentication succeeded [INFO] Fetching image manifest list [INFO] Fetching image manifest [ERROR] URL https://nvcr.io/v2/nvidia/nemo/manifests/nightly returned error code: 404 Not Found

What's wrong?

tjoymeed avatar Jun 30 '25 18:06 tjoymeed

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jul 31 '25 02:07 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Aug 07 '25 02:08 github-actions[bot]