drawnix 修改哪个配置文件可让其他电脑通过https://xxx.xxx.xx.xx：7200访问

修改哪个配置文件可让其他电脑通过https://xxx.xxx.xx.xx：7200访问

Nov 27 '25 07:11 berserker3912

@cwing-nvidia I'd like to understand this use case more. Could you describe the ideal state here? Does an experiment involve a code change or is it just a hyperparameter change? What does "projects" mean here?

Let me also list out some patterns and I'd like to know what's not covered here and we can figure out next steps:

Assuming a user wants to use nemo-rl OOTB w/ no modifications, you can build our container and when you launch your nemo-rl run, first cd /opt/-nemo-rl and then launch from that directory. This assumes you do not mount any external code in and you can just set the checkpoint/log dir somewhere outside and the "launch command" can be stored in some experiment dir
Another pattern we use is code snapshots (see this util for reference https://github.com/NVIDIA-NeMo/RL/blob/main/tools/code_snapshot.sh). Since an experiment may require changing code, it is not safe to modify a local nemo-rl since if you have multiple queued jobs launched with sbatch ray.sub, you won't know what state of the code your job will be launching since the job will launch with whatever the state of nemo-rl is at the time. For this reason, some users will create snapshots via a mechanism similar to code_snapshot.sh before launching so that their code is frozen at the time of launch.

Nov 20 '25 05:11 terrykong

A user forked and cloned NeMo Gym and NeMo RL repos, they would prefer to keep their training job scripts external to the NeMo RL repo directory, but they mentioned it doesn't work trying to run the scripts if the scripts are not in the NeMo RL root directory

@bxyu-nvidia may have more context, he noted this in NeMo Gym documentation here too https://github.com/NVIDIA-NeMo/Gym/blob/c5a1d26e83916a1e7b14bb651aea7ee5a0070cd8/docs/tutorials/rl-training-with-nemo-rl.md?plain=1#L146

Nov 20 '25 18:11 cwing-nvidia

@cwing-nvidia @bxyu-nvidia Here's some options that demonstrate not having to launch inside (This script exists outside nemo-rl repo):

# outside, but use nemo-rl pre-baked
COMMAND="cd /opt/nemo-rl && uv run examples/run_sft.py" \
BASE_LOG_DIR=$OUTSIDE_LOGDIR/outside-prebaked-logs \
CONTAINER=$CONTAINER \
MOUNTS="/lustre:/lustre:ro,$PWD:$PWD" \
sbatch \
    --nodes=$NUM_ACTOR_NODES \
    --account=$ACCOUNT \
    --job-name=$ACCOUNT-rl:$(whoami)-ray-cluster-$(date +%N) \
    --partition=$PARTITION \
    --time=$TIME \
    --gres=gpu:8 \
    $LOCAL_NEMORL_PATH/ray.sub

# outside, but use my local nemo-rl
COMMAND="cd /opt/nemo-rl && uv run examples/run_sft.py" \
BASE_LOG_DIR=$OUTSIDE_LOGDIR/outside-local-logs \
CONTAINER=$CONTAINER \
MOUNTS="/lustre:/lustre:ro,$LOCAL_NEMORL_PATH:/opt/nemo-rl" \
sbatch \
    --nodes=$NUM_ACTOR_NODES \
    --account=$ACCOUNT \
    --job-name=$ACCOUNT-rl:$(whoami)-ray-cluster-$(date +%N) \
    --partition=$PARTITION \
    --time=$TIME \
    --gres=gpu:8 \
    $LOCAL_NEMORL_PATH/ray.sub

# outside, but cd into dir for invocation and don't override /opt/nemo-rl
cd $LOCAL_NEMORL_PATH
COMMAND="uv run examples/run_sft.py" \
BASE_LOG_DIR=$OUTSIDE_LOGDIR/outside-cd-logs \
CONTAINER=$CONTAINER \
MOUNTS="/lustre:/lustre:ro,$LOCAL_NEMORL_PATH:$LOCAL_NEMORL_PATH" \
sbatch \
    --nodes=$NUM_ACTOR_NODES \
    --account=$ACCOUNT \
    --job-name=$ACCOUNT-rl:$(whoami)-ray-cluster-$(date +%N) \
    --partition=$PARTITION \
    --time=$TIME \
    --gres=gpu:8 \
    ray.sub
cd -

Does your use case differ from this?

Nov 20 '25 23:11 terrykong

@cwing-nvidia @bxyu-nvidia can you take a look at Terry's solution?

Dec 03 '25 04:12 snowmanwwg

修改哪个配置文件 可让其他电脑通过https://xxx.xxx.xx.xx：7200访问

修改哪个配置文件可让其他电脑通过https://xxx.xxx.xx.xx：7200访问