guyueh1
guyueh1
@cuichenx This is a change to resolve an OOM issue in our testing, for now I only understand the symptom which is: `linear_output` returns a tensor of shape `(batch*seq, 1,...
This PR needs further work to handle two cases: mcore_gpt = True or False. Change this to draft until that's fixed.
@timmoon10 can you review?
@ZhiyuLi-Nvidia any new updates on the nccl failure at 64 nodes? Do you think https://github.com/NVIDIA-NeMo/RL/issues/1208#issuecomment-3349433766 can help accelerate the checkpointing time, and do you think the nccl timeout is due...
@ZhiyuLi-Nvidia is the mcore change gonna be a PR or already merged?
I have a few questions * where are these added functions used? * Is the purpose to avoid downloading squad from the network and use cached files? * if yes,...
@rhmukundan I see. Can you provide an example of how to use it in run script?
@malay-nagda can you review the added utility functions in `util.py`?
@rhmukundan please fix merge conflict; to fix the DCO, you need to sign every commit and force push (overwrite the current commits) again (refer to https://github.com/NVIDIA/NeMo/pull/13893/checks?check_run_id=44185194908 "rebase the branch")
@rhmukundan is this ready for merge?