修改哪个配置文件 可让其他电脑通过https://xxx.xxx.xx.xx:7200访问
修改哪个配置文件 可让其他电脑通过https://xxx.xxx.xx.xx:7200访问
@cwing-nvidia I'd like to understand this use case more. Could you describe the ideal state here? Does an experiment involve a code change or is it just a hyperparameter change? What does "projects" mean here?
Let me also list out some patterns and I'd like to know what's not covered here and we can figure out next steps:
- Assuming a user wants to use
nemo-rlOOTB w/ no modifications, you can build our container and when you launch your nemo-rl run, firstcd /opt/-nemo-rland then launch from that directory. This assumes you do not mount any external code in and you can just set the checkpoint/log dir somewhere outside and the "launch command" can be stored in some experiment dir - Another pattern we use is code snapshots (see this util for reference https://github.com/NVIDIA-NeMo/RL/blob/main/tools/code_snapshot.sh). Since an experiment may require changing code, it is not safe to modify a local nemo-rl since if you have multiple queued jobs launched with
sbatch ray.sub, you won't know what state of the code your job will be launching since the job will launch with whatever the state of nemo-rl is at the time. For this reason, some users will create snapshots via a mechanism similar tocode_snapshot.shbefore launching so that their code is frozen at the time of launch.
A user forked and cloned NeMo Gym and NeMo RL repos, they would prefer to keep their training job scripts external to the NeMo RL repo directory, but they mentioned it doesn't work trying to run the scripts if the scripts are not in the NeMo RL root directory
@bxyu-nvidia may have more context, he noted this in NeMo Gym documentation here too https://github.com/NVIDIA-NeMo/Gym/blob/c5a1d26e83916a1e7b14bb651aea7ee5a0070cd8/docs/tutorials/rl-training-with-nemo-rl.md?plain=1#L146
@cwing-nvidia @bxyu-nvidia Here's some options that demonstrate not having to launch inside (This script exists outside nemo-rl repo):
# outside, but use nemo-rl pre-baked
COMMAND="cd /opt/nemo-rl && uv run examples/run_sft.py" \
BASE_LOG_DIR=$OUTSIDE_LOGDIR/outside-prebaked-logs \
CONTAINER=$CONTAINER \
MOUNTS="/lustre:/lustre:ro,$PWD:$PWD" \
sbatch \
--nodes=$NUM_ACTOR_NODES \
--account=$ACCOUNT \
--job-name=$ACCOUNT-rl:$(whoami)-ray-cluster-$(date +%N) \
--partition=$PARTITION \
--time=$TIME \
--gres=gpu:8 \
$LOCAL_NEMORL_PATH/ray.sub
# outside, but use my local nemo-rl
COMMAND="cd /opt/nemo-rl && uv run examples/run_sft.py" \
BASE_LOG_DIR=$OUTSIDE_LOGDIR/outside-local-logs \
CONTAINER=$CONTAINER \
MOUNTS="/lustre:/lustre:ro,$LOCAL_NEMORL_PATH:/opt/nemo-rl" \
sbatch \
--nodes=$NUM_ACTOR_NODES \
--account=$ACCOUNT \
--job-name=$ACCOUNT-rl:$(whoami)-ray-cluster-$(date +%N) \
--partition=$PARTITION \
--time=$TIME \
--gres=gpu:8 \
$LOCAL_NEMORL_PATH/ray.sub
# outside, but cd into dir for invocation and don't override /opt/nemo-rl
cd $LOCAL_NEMORL_PATH
COMMAND="uv run examples/run_sft.py" \
BASE_LOG_DIR=$OUTSIDE_LOGDIR/outside-cd-logs \
CONTAINER=$CONTAINER \
MOUNTS="/lustre:/lustre:ro,$LOCAL_NEMORL_PATH:$LOCAL_NEMORL_PATH" \
sbatch \
--nodes=$NUM_ACTOR_NODES \
--account=$ACCOUNT \
--job-name=$ACCOUNT-rl:$(whoami)-ray-cluster-$(date +%N) \
--partition=$PARTITION \
--time=$TIME \
--gres=gpu:8 \
ray.sub
cd -
Does your use case differ from this?
@cwing-nvidia @bxyu-nvidia can you take a look at Terry's solution?