ray.sub requires submission from NeMo RL home directory, blocking external workflow organization
Describe the bug
The ray.sub script currently requires users to submit SLURM jobs from the NeMo RL repository directory, which prevents users from organizing their training scripts and experiments in separate project directories. This creates workflow friction and forces tight coupling between framework code and user code.
Impact
Users cannot maintain their experiment code separate from the framework installation without workarounds. This creates friction for:
- Teams using separate version control for experiments
- Users running multiple projects with one NeMo RL installation
@cwing-nvidia I'd like to understand this use case more. Could you describe the ideal state here? Does an experiment involve a code change or is it just a hyperparameter change? What does "projects" mean here?
Let me also list out some patterns and I'd like to know what's not covered here and we can figure out next steps:
- Assuming a user wants to use
nemo-rlOOTB w/ no modifications, you can build our container and when you launch your nemo-rl run, firstcd /opt/-nemo-rland then launch from that directory. This assumes you do not mount any external code in and you can just set the checkpoint/log dir somewhere outside and the "launch command" can be stored in some experiment dir - Another pattern we use is code snapshots (see this util for reference https://github.com/NVIDIA-NeMo/RL/blob/main/tools/code_snapshot.sh). Since an experiment may require changing code, it is not safe to modify a local nemo-rl since if you have multiple queued jobs launched with
sbatch ray.sub, you won't know what state of the code your job will be launching since the job will launch with whatever the state of nemo-rl is at the time. For this reason, some users will create snapshots via a mechanism similar tocode_snapshot.shbefore launching so that their code is frozen at the time of launch.
A user forked and cloned NeMo Gym and NeMo RL repos, they would prefer to keep their training job scripts external to the NeMo RL repo directory, but they mentioned it doesn't work trying to run the scripts if the scripts are not in the NeMo RL root directory
@bxyu-nvidia may have more context, he noted this in NeMo Gym documentation here too https://github.com/NVIDIA-NeMo/Gym/blob/c5a1d26e83916a1e7b14bb651aea7ee5a0070cd8/docs/tutorials/rl-training-with-nemo-rl.md?plain=1#L146
@cwing-nvidia @bxyu-nvidia Here's some options that demonstrate not having to launch inside (This script exists outside nemo-rl repo):
# outside, but use nemo-rl pre-baked
COMMAND="cd /opt/nemo-rl && uv run examples/run_sft.py" \
BASE_LOG_DIR=$OUTSIDE_LOGDIR/outside-prebaked-logs \
CONTAINER=$CONTAINER \
MOUNTS="/lustre:/lustre:ro,$PWD:$PWD" \
sbatch \
--nodes=$NUM_ACTOR_NODES \
--account=$ACCOUNT \
--job-name=$ACCOUNT-rl:$(whoami)-ray-cluster-$(date +%N) \
--partition=$PARTITION \
--time=$TIME \
--gres=gpu:8 \
$LOCAL_NEMORL_PATH/ray.sub
# outside, but use my local nemo-rl
COMMAND="cd /opt/nemo-rl && uv run examples/run_sft.py" \
BASE_LOG_DIR=$OUTSIDE_LOGDIR/outside-local-logs \
CONTAINER=$CONTAINER \
MOUNTS="/lustre:/lustre:ro,$LOCAL_NEMORL_PATH:/opt/nemo-rl" \
sbatch \
--nodes=$NUM_ACTOR_NODES \
--account=$ACCOUNT \
--job-name=$ACCOUNT-rl:$(whoami)-ray-cluster-$(date +%N) \
--partition=$PARTITION \
--time=$TIME \
--gres=gpu:8 \
$LOCAL_NEMORL_PATH/ray.sub
# outside, but cd into dir for invocation and don't override /opt/nemo-rl
cd $LOCAL_NEMORL_PATH
COMMAND="uv run examples/run_sft.py" \
BASE_LOG_DIR=$OUTSIDE_LOGDIR/outside-cd-logs \
CONTAINER=$CONTAINER \
MOUNTS="/lustre:/lustre:ro,$LOCAL_NEMORL_PATH:$LOCAL_NEMORL_PATH" \
sbatch \
--nodes=$NUM_ACTOR_NODES \
--account=$ACCOUNT \
--job-name=$ACCOUNT-rl:$(whoami)-ray-cluster-$(date +%N) \
--partition=$PARTITION \
--time=$TIME \
--gres=gpu:8 \
ray.sub
cd -
Does your use case differ from this?
@cwing-nvidia @bxyu-nvidia can you take a look at Terry's solution?