RL icon indicating copy to clipboard operation
RL copied to clipboard

Update tools/launch to take into account number of gpus to avoid reaper

Open yfw opened this issue 1 month ago • 0 comments

Describe the bug

We have some jobs in our nightly tests (e.g. vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n2g-dtensor2tp1.v1) that get killed by the Idle Job Reaper because we always request 8 gpus even if we are using less in the run.

https://github.com/NVIDIA-NeMo/RL/blob/fa379fffbc9c5580301fa748dbba269c7d90f883/tools/launch#L165

Steps/Code to reproduce bug

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

yfw avatar Nov 27 '25 00:11 yfw