Chen Cui comments

Results 30 comments of


                                            Chen Cui

[QUESTION] why WrappedTorchLayerNorm sequence parallel not supported by torch LayerNorm？

No, changing the wrapper alone would not work. You would need the underlying implementation to support sequence parallelism

Is Qwen3 pretraining architectural features fully supported now?

Hi, we support general pretraining (without reasoning or long context extension), as well as full and parameter-efficient finetuning.

Is Qwen3 pretraining architectural features fully supported now?

All Qwen 3 variants are supported, including 6 dense models and 2 MoE models.

Is Qwen3 pretraining architectural features fully supported now?

Yes, what you described is correct

Is Qwen3 pretraining architectural features fully supported now?

We're working on better long context training support right now, but I don't have any near term ETA to share with you at this time. Qwen3 MoE recipes can be...

Is Qwen3 pretraining architectural features fully supported now?

> Great! Thanks a lot! Could you please enlighten me a bit on this: how to modify the recipe for qwen3_30b_a3b into pretraining from scratch a qwen3_12b-a1b? Thanks again! There...

Is Qwen3 pretraining architectural features fully supported now?

Yes, Sept of 2025. We're not familiar with PAI-Megaton-Patch. You can ask in that repo for pointers.

fix adapter in/out_features size for share_expert when use overlap.

on hold, awaiting update on #12960

[peft] align adapter output shape with wrapped module output shape

I think adding `(batch*seq, 1, hidden)` and `(batch*seq, hidden)` won't just OOM -- it will not add at all, right? I'm not sure how you got there. When I tested...

Inconsisten documentation and lack of support of LLama-Nemotron-VL

Hi @adithya-s-k , Llama Nemotron VL was meant to be supported only in the special container, as noted in the documentation. The PR will not be merged. We have also...