RL icon indicating copy to clipboard operation
RL copied to clipboard

ray.sub requires submission from NeMo RL home directory, blocking external workflow organization

Open cwing-nvidia opened this issue 1 month ago • 3 comments

Describe the bug

The ray.sub script currently requires users to submit SLURM jobs from the NeMo RL repository directory, which prevents users from organizing their training scripts and experiments in separate project directories. This creates workflow friction and forces tight coupling between framework code and user code.

Impact

Users cannot maintain their experiment code separate from the framework installation without workarounds. This creates friction for:

  • Teams using separate version control for experiments
  • Users running multiple projects with one NeMo RL installation

cwing-nvidia avatar Nov 20 '25 02:11 cwing-nvidia

@cwing-nvidia I'd like to understand this use case more. Could you describe the ideal state here? Does an experiment involve a code change or is it just a hyperparameter change? What does "projects" mean here?

Let me also list out some patterns and I'd like to know what's not covered here and we can figure out next steps:

  1. Assuming a user wants to use nemo-rl OOTB w/ no modifications, you can build our container and when you launch your nemo-rl run, first cd /opt/-nemo-rl and then launch from that directory. This assumes you do not mount any external code in and you can just set the checkpoint/log dir somewhere outside and the "launch command" can be stored in some experiment dir
  2. Another pattern we use is code snapshots (see this util for reference https://github.com/NVIDIA-NeMo/RL/blob/main/tools/code_snapshot.sh). Since an experiment may require changing code, it is not safe to modify a local nemo-rl since if you have multiple queued jobs launched with sbatch ray.sub, you won't know what state of the code your job will be launching since the job will launch with whatever the state of nemo-rl is at the time. For this reason, some users will create snapshots via a mechanism similar to code_snapshot.sh before launching so that their code is frozen at the time of launch.

terrykong avatar Nov 20 '25 05:11 terrykong

A user forked and cloned NeMo Gym and NeMo RL repos, they would prefer to keep their training job scripts external to the NeMo RL repo directory, but they mentioned it doesn't work trying to run the scripts if the scripts are not in the NeMo RL root directory

@bxyu-nvidia may have more context, he noted this in NeMo Gym documentation here too https://github.com/NVIDIA-NeMo/Gym/blob/c5a1d26e83916a1e7b14bb651aea7ee5a0070cd8/docs/tutorials/rl-training-with-nemo-rl.md?plain=1#L146

cwing-nvidia avatar Nov 20 '25 18:11 cwing-nvidia

@cwing-nvidia @bxyu-nvidia Here's some options that demonstrate not having to launch inside (This script exists outside nemo-rl repo):

# outside, but use nemo-rl pre-baked
COMMAND="cd /opt/nemo-rl && uv run examples/run_sft.py" \
BASE_LOG_DIR=$OUTSIDE_LOGDIR/outside-prebaked-logs \
CONTAINER=$CONTAINER \
MOUNTS="/lustre:/lustre:ro,$PWD:$PWD" \
sbatch \
    --nodes=$NUM_ACTOR_NODES \
    --account=$ACCOUNT \
    --job-name=$ACCOUNT-rl:$(whoami)-ray-cluster-$(date +%N) \
    --partition=$PARTITION \
    --time=$TIME \
    --gres=gpu:8 \
    $LOCAL_NEMORL_PATH/ray.sub

# outside, but use my local nemo-rl
COMMAND="cd /opt/nemo-rl && uv run examples/run_sft.py" \
BASE_LOG_DIR=$OUTSIDE_LOGDIR/outside-local-logs \
CONTAINER=$CONTAINER \
MOUNTS="/lustre:/lustre:ro,$LOCAL_NEMORL_PATH:/opt/nemo-rl" \
sbatch \
    --nodes=$NUM_ACTOR_NODES \
    --account=$ACCOUNT \
    --job-name=$ACCOUNT-rl:$(whoami)-ray-cluster-$(date +%N) \
    --partition=$PARTITION \
    --time=$TIME \
    --gres=gpu:8 \
    $LOCAL_NEMORL_PATH/ray.sub

# outside, but cd into dir for invocation and don't override /opt/nemo-rl
cd $LOCAL_NEMORL_PATH
COMMAND="uv run examples/run_sft.py" \
BASE_LOG_DIR=$OUTSIDE_LOGDIR/outside-cd-logs \
CONTAINER=$CONTAINER \
MOUNTS="/lustre:/lustre:ro,$LOCAL_NEMORL_PATH:$LOCAL_NEMORL_PATH" \
sbatch \
    --nodes=$NUM_ACTOR_NODES \
    --account=$ACCOUNT \
    --job-name=$ACCOUNT-rl:$(whoami)-ray-cluster-$(date +%N) \
    --partition=$PARTITION \
    --time=$TIME \
    --gres=gpu:8 \
    ray.sub
cd -

Does your use case differ from this?

terrykong avatar Nov 20 '25 23:11 terrykong

@cwing-nvidia @bxyu-nvidia can you take a look at Terry's solution?

snowmanwwg avatar Dec 03 '25 04:12 snowmanwwg