ZhiyuLi-Nvidia

Results 16 comments of ZhiyuLi-Nvidia

@akoumpa As discussed offline: - qwen3(Qwen3ForCausalLM) --> fails with SP - qwen2(Qwen2ForCausalLM) --> works with SP We can either keep qwen2 or bypass the following functional test https://github.com/NVIDIA-NeMo/Automodel/blob/main/tests/functional_tests/hf_transformer_finetune/L2_HF_Transformer_PEFT_Benchmark_qwen2_custom.sh

https://github.com/NVIDIA-NeMo/RL/pull/1557 this should fix the issue simply, using the commit in https://github.com/NVIDIA-NeMo/Automodel/pull/804/commits/282aca0927a5f017acf9c7e577075efce6145b26

llama model seems have a different error messge: ``` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 237, in forward query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 135, in apply_rotary_pos_emb...

* sharding annotation mismatch issue: RuntimeError: ('Attempted to flatten sharded dimension 1, ', 'but only the leftmost dim of a Flatten can be sharded.') - https://github.com/NVIDIA-NeMo/RL/pull/1557 should address sharding annotation...

Muon has achieved the following empirical results. - Improved the speed record for training to 94% accuracy on CIFAR-10 from [3.3 to 2.6 A100-seconds.](https://x.com/kellerjordan0/status/1855675023916499249) - Improved the speed record for...

> [@ZhiyuLi-Nvidia](https://github.com/ZhiyuLi-Nvidia) is this something you can review ? Sure.

@cmunley1 which version or branch were you using? There's a recent fix of memory leak relevant to YARN https://github.com/NVIDIA-NeMo/RL/pull/1163

Thank you @guyueh1 I have tried with updated mcore version and haven't seen any cpu memory leak in reproduction and shared that with @cmunley1 on Monday: I have bump up...

I have tried but the cpu memory increasing is almost negletable? * branch: https://github.com/NVIDIA-NeMo/RL/compare/main...zhiyul/oom_repro_w_cpu_profiler * change on top of the guide: https://github.com/NVIDIA-NeMo/RL/compare/f67ccd9e9cf7e2c1b30c23b6cb2c305bf1dfff36...zhiyul/oom_repro_w_cpu_profiler What's new * added profiler feature to track...

What we found: * a nccl collective timeout when scaling up to 64 nodes * save time increases with increasing number of nodes, and ~10mins in saving a 30b model.