verl
verl copied to clipboard
Issue with Multi-node training
Hello! I am trying to do multi node training, but I keep running into a ray error. I was wondering if you have seen this error before.
Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=/leonardo_work/EUHPC_E03_068/marianna/lf_datasets/openai/gsm8k/prepared/train.parquet', 'data.val_files=/leonardo_work/EUHPC_E03_068/marianna/lf_datasets/openai/gsm8k/prepared/test.parquet', 'data.train_batch_size=8', 'data.val_batch_size=4', 'data.max_prompt_length=1024', 'data.max_response_length=8192', 'actor_rollout_ref.model.path=/leonardo_work/EUHPC_E03_068/marianna/models/Qwen/Qwen2.5-7B-Instruct', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=8', 'actor_rollout_ref.actor.ppo_micro_batch_size=4', 'actor_rollout_ref.actor.use_dynamic_bsz=True', 'actor_rollout_ref.actor.ppo_max_token_len_per_gpu=16384', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.001', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.actor.ulysses_sequence_parallel_size=1', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.actor.fsdp_config.param_offload=False', 'actor_rollout_ref.actor.fsdp_config.grad_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=False', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.temperature=0.6', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.6', 'actor_rollout_ref.rollout.n=5', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=[console]', 'trainer.project_name=deepscaler', 'trainer.experiment_name=openthinker-7b-8k-vanilla-4nodes', '+trainer.val_before_train=True', 'trainer.n_gpus_per_node=4', 'trainer.nnodes=1', 'trainer.save_freq=20', 'trainer.test_freq=20', 'trainer.default_hdfs_dir=null', 'trainer.total_epochs=30']
Traceback (most recent call last):
File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/leonardo_work/EUHPC_E03_068/eguha/deepscaler/verl/verl/trainer/main_ppo.py", line 204, in <module>
main()
File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/leonardo_work/EUHPC_E03_068/eguha/deepscaler/verl/verl/trainer/main_ppo.py", line 114, in main
ray.get(main_task.remote(config))
File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py", line 2755, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py", line 906, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): [36mray::main_task()[39m (pid=1289300, ip=10.3.0.113)
File "/leonardo_work/EUHPC_E03_068/eguha/deepscaler/verl/verl/trainer/main_ppo.py", line 199, in main_task
trainer.init_workers()
File "/leonardo_work/EUHPC_E03_068/eguha/deepscaler/verl/verl/trainer/ppo/ray_trainer.py", line 510, in init_workers
wg_dict = self.ray_worker_group_cls(resource_pool=resource_pool, ray_cls_with_init=worker_dict_cls)
File "/leonardo_work/EUHPC_E03_068/eguha/deepscaler/verl/verl/single_controller/ray/base.py", line 197, in __init__
self._init_with_resource_pool(resource_pool=resource_pool,
File "/leonardo_work/EUHPC_E03_068/eguha/deepscaler/verl/verl/single_controller/ray/base.py", line 274, in _init_with_resource_pool
assert register_center_actor is not None, f"failed to get register_center_actor: {self.name_prefix}_register_center in {list_named_actors(all_namespaces=True)}"
AssertionError: failed to get register_center_actor: jkyhiD_register_center in [{'name': 'jkyhiDWorkerDict_0:0', 'namespace': '8b1068b4-c50e-4a92-9adc-fcfac21ad64b'}]
srun: error: lrdn0833: task 0: Exited with exit code 1
here is my slurm script
#!/bin/bash -x
#SBATCH --nodes=2
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --account=EUHPC_E03_068
#SBATCH --partition=boost_usr_prod
#SBATCH --qos boost_qos_dbg
#SBATCH --threads-per-core=1
#SBATCH --time=0:20:00
#SBATCH --exclusive
#SBATCH --mem=0
#SBATCH --output=/leonardo_work/EUHPC_E03_068/eguha/deepscaler/slurm_logs/%x_%j.out
set -x
# Warning: Export VLLM_ATTENTION_BACKEND on every machine before starting Ray cluster.
# vLLM without XFORMERS will results in CUDA errors.
export VLLM_ATTENTION_BACKEND=XFORMERS
# Parse command line arguments
while [[ $# -gt 0 ]]; do
case $1 in
--model)
MODEL_PATH="$2"
shift 2
;;
*)
break
;;
esac
done
# Set default model path if not provided
if [ -z "$MODEL_PATH" ]; then
MODEL_PATH="/leonardo_work/EUHPC_E03_068/marianna/models/Qwen/Qwen2.5-7B-Instruct"
fi
# module load nccl
# module load cuda/12.1
# module load gcc/12.2.0
# conda activate verl
CONDA_ENV="$WORK/marianna/envs/vllm"
MINICONDA_PATH="$WORK/marianna/miniconda/miniforge"
source ${MINICONDA_PATH}/bin/activate ${CONDA_ENV}
export HF_HOME="$SCRATCH/HF_cache"
export CURATOR_CACHE_DIR=$SCRATCH/curator_cache
mkdir -p $CURATOR_CACHE_DIR
export OUTLINES_CACHE_DIR=$WORK/marianna/outlines_cache/$SLURM_JOB_ID
mkdir -p $OUTLINES_CACHE_DIR
export TOKENIZERS_PARALLELISM=false
export CUDA_VISIBLE_DEVICES=0,1,2,3
export WORK="/leonardo_work/EUHPC_E03_068"
export NCCL_SOCKET_IFNAME=ib0
export GLOO_SOCKET_IFNAME=ib0
export NCCL_DEBUG=INFO
export NCCL_IB_TIMEOUT=60
export NCCL_NET_GDR_LEVEL=PIX # Use GPU Direct RDMA when GPU and NIC are on the same PCI switch
export NCCL_SOCKET_IFNAME=ib0
export NCCL_IB_TIMEOUT=120
export NCCL_DEBUG=INFO
# CONDA_ENV="$WORK/gsmyrnis/miniforge3/envs/sglang124"
# MINICONDA_PATH="$WORK/gsmyrnis/miniforge3"
# source ${MINICONDA_PATH}/bin/activate ${CONDA_ENV}
NUM_GPUS=4
export SLURM_GPUS_PER_NODE=$NUM_GPUS
export SLURM_CPUS_PER_TASK=32
# export CUDA_VISIBLE_DEVICES=0,1,2,3
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)
head_node=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
head_node_i="${head_node}"
head_node_ip="$(nslookup "$head_node_i" | grep -oP '(?<=Address: ).*')"
export head_node_ip=$head_node_ip
echo "Head node: $head_node_ip"
port=20156
ip_head=$head_node_ip:$port
export ip_head
echo "IP Head: $ip_head"
echo "Starting HEAD at $head_node"
srun --nodes=1 --ntasks=1 -w "$head_node" --gres=gpu:4 --cpus-per-task=32 \
ray start --head --node-ip-address="$head_node_ip" --port=$port --num-gpus $NUM_GPUS \
--num-cpus ${SLURM_CPUS_PER_TASK} --block &
sleep 10
worker_num=$((SLURM_JOB_NUM_NODES - 1))
for ((i = 1; i <= worker_num; i++)); do
node=${nodes_array[$i]}
node_i="${node}"
echo "Starting WORKER $i at $node"
this_node_ip="$(nslookup "$node_i" | grep -oP '(?<=Address: ).*')"
srun --nodes=1 --ntasks=1 -w "$node" --gres=gpu:4 --cpus-per-task=32 \
ray start --address "$ip_head" \
--node-ip-address="$this_node_ip" \
--num-cpus ${SLURM_CPUS_PER_TASK} --num-gpus $NUM_GPUS --block &
sleep 10
done
sleep 30
export RAY_ADDRESS="$head_node_ip:$port"
ray status
DATA_TRAIN=/leonardo_work/EUHPC_E03_068/marianna/lf_datasets/openai/gsm8k/prepared/train.parquet
DATA_VAL=/leonardo_work/EUHPC_E03_068/marianna/lf_datasets/openai/gsm8k/prepared/test.parquet
hostname
export HYDRA_FULL_ERROR=1
srun --overlap --nodes=1 --ntasks=1 -w "$head_node" python3 -u -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=$DATA_TRAIN \
data.val_files=$DATA_VAL \
data.train_batch_size=8 \
data.val_batch_size=4 \
data.max_prompt_length=1024 \
data.max_response_length=8192 \
actor_rollout_ref.model.path=$MODEL_PATH \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=8 \
actor_rollout_ref.actor.ppo_micro_batch_size=4 \
actor_rollout_ref.actor.use_dynamic_bsz=True \
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=16384 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.ulysses_sequence_parallel_size=1 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.grad_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.temperature=0.6 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.rollout.n=5 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.critic_warmup=0 \
trainer.logger=['console'] \
trainer.project_name='deepscaler' \
trainer.experiment_name='openthinker-7b-8k-vanilla-4nodes' \
+trainer.val_before_train=True \
trainer.n_gpus_per_node=4 \
trainer.nnodes=1 \
trainer.save_freq=20 \
trainer.test_freq=20 \
trainer.default_hdfs_dir=null \
trainer.total_epochs=30 "${@:1}"
I'd appreciate any advice!
Multi-node GRPO only works with ray job submit <your-args> -- python3 -u -m verl.trainer.main_ppo ...
srun ray job submit --address "http://$RAY_ADDRESS" -- python3 -u -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=$DATA_TRAIN \
data.val_files=$DATA_VAL \
data.train_batch_size=8 \
data.val_batch_size=4 \
data.max_prompt_length=1024 \
data.max_response_length=8192 \
actor_rollout_ref.model.path=$MODEL_PATH \
actor_rollout_ref.actor.optim.lr=1e-6 \
I tried this but im still getting the following error
srun: Job 13559592 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 13559592 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 13559592 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 13559592 step creation still disabled, retrying (Requested nodes are busy)
do you have any examples?
Has it been solved?
i don't know the solution here
I ran into similar issue with single node training. Do you have any insight to fix?
AssertionError: failed to get register_center_actor: t2Lrtj_register_center in [{'name': 't2LrtjWorkerDict_0:0', 'namespace': '47191b06-6a6f-407c-96b2-afc6319e5bc5'}]
I ran into similar issue with single node training. Do you have any insight to fix?
AssertionError: failed to get register_center_actor: t2Lrtj_register_center in [{'name': 't2LrtjWorkerDict_0:0', 'namespace': '47191b06-6a6f-407c-96b2-afc6319e5bc5'}]
same issue with single node training
I have solved this issue by simply adding more time (from 120 to 360) for ray.get_actor. Hopefully we could get more robust codes in the future here.
See line 288 in
verl/single_controller/ray/base.py
Same issue with single node training. And adding more time has no effect.
For the named_actor not found problem, please try with the following solutions:
- There might be resources from previous runs not cleaned up (see named actor lifetime): please try to relaunch the ray cluster, using clis i.e.,
ray down ...orray stopon every node, and restart the cluster again. - The launch task for the
named_actormight be delayed in the cluster for various reasons: please try to increase the waiting time, we can make the default waiting time longer and configurable.
Let us know if the problem persists there after trying the above solutions.
@hongpeng-guo Does that mean I cannot simultaneously run two verl programs on the same node / ray cluster?
@skpig You should launch two verl jobs in two separate clusters. As named actors are global unique in each cluster, launching two jobs in the same cluster is an undefined behavior.
It's not very common to launch multiple ray clusters on a single physical node. You may need to do some research on the resource isolation / using some container tools.
Thanks for the quick response. Is it possible launch two verl jobs in one cluster with different namespaces?
As named actors are global unique in each cluster, launching two jobs in the same cluster is an undefined behavior.
I'm not very familiar with Ray. Is the reason that two jobs might instantiate actors with identical names, potentially causing conflicts?
As named actors are global unique in each cluster, launching two jobs in the same cluster is an undefined behavior.
I'm not very familiar with Ray. Is the reason that two jobs might instantiate actors with identical names, potentially causing conflicts?
Yes, basically trying to create two named actors with the same name may cause some problem. You can try to apply different namespaces, but I am not sure if there might be other resource sharing related problem when you do it. verl doesn't support it but you can create a fork to have a try.
BTW, why do you need to run two verl jobs in a single cluster? You can create two separate Ray clusters if you have enough resources.
why do you need to run two verl jobs in a single cluster?
It seems like launching two clusters in one node is not recommended :) Thanks anyway! I will give it a try and see how it goes.