RL
RL copied to clipboard
Update tools/launch to take into account number of gpus to avoid reaper
Describe the bug
We have some jobs in our nightly tests (e.g. vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n2g-dtensor2tp1.v1) that get killed by the Idle Job Reaper because we always request 8 gpus even if we are using less in the run.
https://github.com/NVIDIA-NeMo/RL/blob/fa379fffbc9c5580301fa748dbba269c7d90f883/tools/launch#L165
Steps/Code to reproduce bug
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.