Chen Cui
Chen Cui
No, changing the wrapper alone would not work. You would need the underlying implementation to support sequence parallelism
Hi, we support general pretraining (without reasoning or long context extension), as well as full and parameter-efficient finetuning.
All Qwen 3 variants are supported, including 6 dense models and 2 MoE models.
Yes, what you described is correct
We're working on better long context training support right now, but I don't have any near term ETA to share with you at this time. Qwen3 MoE recipes can be...
> Great! Thanks a lot! Could you please enlighten me a bit on this: how to modify the recipe for qwen3_30b_a3b into pretraining from scratch a qwen3_12b-a1b? Thanks again! There...
Yes, Sept of 2025. We're not familiar with PAI-Megaton-Patch. You can ask in that repo for pointers.
on hold, awaiting update on #12960
I think adding `(batch*seq, 1, hidden)` and `(batch*seq, hidden)` won't just OOM -- it will not add at all, right? I'm not sure how you got there. When I tested...
Hi @adithya-s-k , Llama Nemotron VL was meant to be supported only in the special container, as noted in the documentation. The PR will not be merged. We have also...