guyueh1 comments

Results 29 comments of


                                            guyueh1

[peft] align adapter output shape with wrapped module output shape

@cuichenx This is a change to resolve an OOM issue in our testing, for now I only understand the symptom which is: `linear_output` returns a tensor of shape `(batch*seq, 1,...

Remove fp8 model init context because it is handled by MCORE

This PR needs further work to handle two cases: mcore_gpt = True or False. Change this to draft until that's fixed.

Remove fp8 model init context because it is handled by MCORE

@timmoon10 can you review?

Qwen3-30B-A3B: Checkpoint Save Failures on Large-Scale GPU Configurations (H100) and Small-Scale GB200 Systems

@ZhiyuLi-Nvidia any new updates on the nccl failure at 64 nodes? Do you think https://github.com/NVIDIA-NeMo/RL/issues/1208#issuecomment-3349433766 can help accelerate the checkpointing time, and do you think the nccl timeout is due...

Qwen3-30B-A3B: Checkpoint Save Failures on Large-Scale GPU Configurations (H100) and Small-Scale GB200 Systems

@ZhiyuLi-Nvidia is the mcore change gonna be a PR or already merged?

Fix for Squad Dataset Download

I have a few questions * where are these added functions used? * Is the purpose to avoid downloading squad from the network and use cached files? * if yes,...

Fix for Squad Dataset Download

@rhmukundan I see. Can you provide an example of how to use it in run script?

Fix for Squad Dataset Download

@malay-nagda can you review the added utility functions in `util.py`?

Fix for Squad Dataset Download

@rhmukundan please fix merge conflict; to fix the DCO, you need to sign every commit and force push (overwrite the current commits) again (refer to https://github.com/NVIDIA/NeMo/pull/13893/checks?check_run_id=44185194908 "rebase the branch")

Fix for Squad Dataset Download

@rhmukundan is this ready for merge?