Chen Cui

Results 30 comments of Chen Cui

No, changing the wrapper alone would not work. You would need the underlying implementation to support sequence parallelism

Hi, we support general pretraining (without reasoning or long context extension), as well as full and parameter-efficient finetuning.

All Qwen 3 variants are supported, including 6 dense models and 2 MoE models.

We're working on better long context training support right now, but I don't have any near term ETA to share with you at this time. Qwen3 MoE recipes can be...

> Great! Thanks a lot! Could you please enlighten me a bit on this: how to modify the recipe for qwen3_30b_a3b into pretraining from scratch a qwen3_12b-a1b? Thanks again! There...

Yes, Sept of 2025. We're not familiar with PAI-Megaton-Patch. You can ask in that repo for pointers.

I think adding `(batch*seq, 1, hidden)` and `(batch*seq, hidden)` won't just OOM -- it will not add at all, right? I'm not sure how you got there. When I tested...

Hi @adithya-s-k , Llama Nemotron VL was meant to be supported only in the special container, as noted in the documentation. The PR will not be merged. We have also...