verl icon indicating copy to clipboard operation
verl copied to clipboard

Issue with Multi-node training

Open EtashGuha opened this issue 9 months ago • 1 comments

Hello! I am trying to do multi node training, but I keep running into a ray error. I was wondering if you have seen this error before.

Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=/leonardo_work/EUHPC_E03_068/marianna/lf_datasets/openai/gsm8k/prepared/train.parquet', 'data.val_files=/leonardo_work/EUHPC_E03_068/marianna/lf_datasets/openai/gsm8k/prepared/test.parquet', 'data.train_batch_size=8', 'data.val_batch_size=4', 'data.max_prompt_length=1024', 'data.max_response_length=8192', 'actor_rollout_ref.model.path=/leonardo_work/EUHPC_E03_068/marianna/models/Qwen/Qwen2.5-7B-Instruct', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=8', 'actor_rollout_ref.actor.ppo_micro_batch_size=4', 'actor_rollout_ref.actor.use_dynamic_bsz=True', 'actor_rollout_ref.actor.ppo_max_token_len_per_gpu=16384', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.001', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.actor.ulysses_sequence_parallel_size=1', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.actor.fsdp_config.param_offload=False', 'actor_rollout_ref.actor.fsdp_config.grad_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=False', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.temperature=0.6', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.6', 'actor_rollout_ref.rollout.n=5', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=[console]', 'trainer.project_name=deepscaler', 'trainer.experiment_name=openthinker-7b-8k-vanilla-4nodes', '+trainer.val_before_train=True', 'trainer.n_gpus_per_node=4', 'trainer.nnodes=1', 'trainer.save_freq=20', 'trainer.test_freq=20', 'trainer.default_hdfs_dir=null', 'trainer.total_epochs=30']
Traceback (most recent call last):
  File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/leonardo_work/EUHPC_E03_068/eguha/deepscaler/verl/verl/trainer/main_ppo.py", line 204, in <module>
    main()
  File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/leonardo_work/EUHPC_E03_068/eguha/deepscaler/verl/verl/trainer/main_ppo.py", line 114, in main
    ray.get(main_task.remote(config))
  File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py", line 2755, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/leonardo_work/EUHPC_E03_068/marianna/envs/vllm/lib/python3.10/site-packages/ray/_private/worker.py", line 906, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): [36mray::main_task()[39m (pid=1289300, ip=10.3.0.113)
  File "/leonardo_work/EUHPC_E03_068/eguha/deepscaler/verl/verl/trainer/main_ppo.py", line 199, in main_task
    trainer.init_workers()
  File "/leonardo_work/EUHPC_E03_068/eguha/deepscaler/verl/verl/trainer/ppo/ray_trainer.py", line 510, in init_workers
    wg_dict = self.ray_worker_group_cls(resource_pool=resource_pool, ray_cls_with_init=worker_dict_cls)
  File "/leonardo_work/EUHPC_E03_068/eguha/deepscaler/verl/verl/single_controller/ray/base.py", line 197, in __init__
    self._init_with_resource_pool(resource_pool=resource_pool,
  File "/leonardo_work/EUHPC_E03_068/eguha/deepscaler/verl/verl/single_controller/ray/base.py", line 274, in _init_with_resource_pool
    assert register_center_actor is not None, f"failed to get register_center_actor: {self.name_prefix}_register_center in {list_named_actors(all_namespaces=True)}"
AssertionError: failed to get register_center_actor: jkyhiD_register_center in [{'name': 'jkyhiDWorkerDict_0:0', 'namespace': '8b1068b4-c50e-4a92-9adc-fcfac21ad64b'}]
srun: error: lrdn0833: task 0: Exited with exit code 1

here is my slurm script

#!/bin/bash -x
#SBATCH --nodes=2
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --account=EUHPC_E03_068
#SBATCH --partition=boost_usr_prod
#SBATCH --qos boost_qos_dbg
#SBATCH --threads-per-core=1
#SBATCH --time=0:20:00
#SBATCH --exclusive
#SBATCH --mem=0
#SBATCH --output=/leonardo_work/EUHPC_E03_068/eguha/deepscaler/slurm_logs/%x_%j.out
set -x
# Warning: Export VLLM_ATTENTION_BACKEND on every machine before starting Ray cluster.
# vLLM without XFORMERS will results in CUDA errors.
export VLLM_ATTENTION_BACKEND=XFORMERS
# Parse command line arguments
while [[ $# -gt 0 ]]; do
    case $1 in
        --model)
            MODEL_PATH="$2"
            shift 2
            ;;
        *)
            break
            ;;
    esac
done
# Set default model path if not provided
if [ -z "$MODEL_PATH" ]; then
    MODEL_PATH="/leonardo_work/EUHPC_E03_068/marianna/models/Qwen/Qwen2.5-7B-Instruct"
fi
# module load nccl
# module load cuda/12.1
# module load gcc/12.2.0
# conda activate verl
CONDA_ENV="$WORK/marianna/envs/vllm"
MINICONDA_PATH="$WORK/marianna/miniconda/miniforge"
source ${MINICONDA_PATH}/bin/activate ${CONDA_ENV}
export HF_HOME="$SCRATCH/HF_cache"
export CURATOR_CACHE_DIR=$SCRATCH/curator_cache
mkdir -p $CURATOR_CACHE_DIR
export OUTLINES_CACHE_DIR=$WORK/marianna/outlines_cache/$SLURM_JOB_ID
mkdir -p $OUTLINES_CACHE_DIR
export TOKENIZERS_PARALLELISM=false
export CUDA_VISIBLE_DEVICES=0,1,2,3
export WORK="/leonardo_work/EUHPC_E03_068"
export NCCL_SOCKET_IFNAME=ib0
export GLOO_SOCKET_IFNAME=ib0
export NCCL_DEBUG=INFO
export NCCL_IB_TIMEOUT=60
export NCCL_NET_GDR_LEVEL=PIX # Use GPU Direct RDMA when GPU and NIC are on the same PCI switch
export NCCL_SOCKET_IFNAME=ib0
export NCCL_IB_TIMEOUT=120
export NCCL_DEBUG=INFO
# CONDA_ENV="$WORK/gsmyrnis/miniforge3/envs/sglang124"
# MINICONDA_PATH="$WORK/gsmyrnis/miniforge3"
# source ${MINICONDA_PATH}/bin/activate ${CONDA_ENV}
NUM_GPUS=4
export SLURM_GPUS_PER_NODE=$NUM_GPUS
export SLURM_CPUS_PER_TASK=32
# export CUDA_VISIBLE_DEVICES=0,1,2,3
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)
head_node=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
head_node_i="${head_node}"
head_node_ip="$(nslookup "$head_node_i" | grep -oP '(?<=Address: ).*')"
export head_node_ip=$head_node_ip
echo "Head node: $head_node_ip"
port=20156
ip_head=$head_node_ip:$port
export ip_head
echo "IP Head: $ip_head"
echo "Starting HEAD at $head_node"
srun --nodes=1 --ntasks=1 -w "$head_node" --gres=gpu:4 --cpus-per-task=32 \
    ray start --head --node-ip-address="$head_node_ip" --port=$port --num-gpus $NUM_GPUS \
    --num-cpus ${SLURM_CPUS_PER_TASK}  --block &
sleep 10
worker_num=$((SLURM_JOB_NUM_NODES - 1))
for ((i = 1; i <= worker_num; i++)); do
    node=${nodes_array[$i]}
    node_i="${node}"
    echo "Starting WORKER $i at $node"
    this_node_ip="$(nslookup "$node_i" | grep -oP '(?<=Address: ).*')"
    srun --nodes=1 --ntasks=1 -w "$node" --gres=gpu:4 --cpus-per-task=32 \
        ray start --address "$ip_head" \
        --node-ip-address="$this_node_ip"  \
        --num-cpus ${SLURM_CPUS_PER_TASK} --num-gpus $NUM_GPUS --block &
    sleep 10
done
sleep 30
export RAY_ADDRESS="$head_node_ip:$port"
ray status
DATA_TRAIN=/leonardo_work/EUHPC_E03_068/marianna/lf_datasets/openai/gsm8k/prepared/train.parquet
DATA_VAL=/leonardo_work/EUHPC_E03_068/marianna/lf_datasets/openai/gsm8k/prepared/test.parquet
hostname
export HYDRA_FULL_ERROR=1
srun --overlap --nodes=1 --ntasks=1 -w "$head_node" python3 -u -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=$DATA_TRAIN \
    data.val_files=$DATA_VAL \
    data.train_batch_size=8 \
    data.val_batch_size=4 \
    data.max_prompt_length=1024 \
    data.max_response_length=8192 \
    actor_rollout_ref.model.path=$MODEL_PATH  \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=8 \
    actor_rollout_ref.actor.ppo_micro_batch_size=4 \
    actor_rollout_ref.actor.use_dynamic_bsz=True \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=16384 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.ulysses_sequence_parallel_size=1 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.grad_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.temperature=0.6 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.n=5 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.critic_warmup=0 \
    trainer.logger=['console'] \
    trainer.project_name='deepscaler' \
    trainer.experiment_name='openthinker-7b-8k-vanilla-4nodes' \
    +trainer.val_before_train=True \
    trainer.n_gpus_per_node=4 \
    trainer.nnodes=1 \
    trainer.save_freq=20 \
    trainer.test_freq=20 \
    trainer.default_hdfs_dir=null \
    trainer.total_epochs=30 "${@:1}"

I'd appreciate any advice!

EtashGuha avatar Mar 05 '25 21:03 EtashGuha

Multi-node GRPO only works with ray job submit <your-args> -- python3 -u -m verl.trainer.main_ppo ...

casper-hansen avatar Mar 06 '25 15:03 casper-hansen


srun ray job submit --address "http://$RAY_ADDRESS" -- python3 -u -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=$DATA_TRAIN \
    data.val_files=$DATA_VAL \
    data.train_batch_size=8 \
    data.val_batch_size=4 \
    data.max_prompt_length=1024 \
    data.max_response_length=8192 \
    actor_rollout_ref.model.path=$MODEL_PATH  \
    actor_rollout_ref.actor.optim.lr=1e-6 \

I tried this but im still getting the following error

srun: Job 13559592 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 13559592 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 13559592 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 13559592 step creation still disabled, retrying (Requested nodes are busy)

do you have any examples?

EtashGuha avatar Mar 07 '25 20:03 EtashGuha

Has it been solved?

fufu1013 avatar Mar 08 '25 03:03 fufu1013

i don't know the solution here

EtashGuha avatar Mar 08 '25 18:03 EtashGuha

I ran into similar issue with single node training. Do you have any insight to fix?

AssertionError: failed to get register_center_actor: t2Lrtj_register_center in [{'name': 't2LrtjWorkerDict_0:0', 'namespace': '47191b06-6a6f-407c-96b2-afc6319e5bc5'}]

Mingyuan1997 avatar Mar 25 '25 02:03 Mingyuan1997

I ran into similar issue with single node training. Do you have any insight to fix?

AssertionError: failed to get register_center_actor: t2Lrtj_register_center in [{'name': 't2LrtjWorkerDict_0:0', 'namespace': '47191b06-6a6f-407c-96b2-afc6319e5bc5'}]

same issue with single node training

ftgreat avatar Mar 25 '25 10:03 ftgreat

I have solved this issue by simply adding more time (from 120 to 360) for ray.get_actor. Hopefully we could get more robust codes in the future here.

See line 288 in

verl/single_controller/ray/base.py

Image

Mingyuan1997 avatar Mar 25 '25 19:03 Mingyuan1997

Same issue with single node training. And adding more time has no effect.

skpig avatar Apr 04 '25 15:04 skpig

For the named_actor not found problem, please try with the following solutions:

  1. There might be resources from previous runs not cleaned up (see named actor lifetime): please try to relaunch the ray cluster, using clis i.e., ray down ... or ray stop on every node, and restart the cluster again.
  2. The launch task for the named_actor might be delayed in the cluster for various reasons: please try to increase the waiting time, we can make the default waiting time longer and configurable.

Let us know if the problem persists there after trying the above solutions.

hongpeng-guo avatar Apr 04 '25 21:04 hongpeng-guo

@hongpeng-guo Does that mean I cannot simultaneously run two verl programs on the same node / ray cluster?

skpig avatar Apr 05 '25 01:04 skpig

@skpig You should launch two verl jobs in two separate clusters. As named actors are global unique in each cluster, launching two jobs in the same cluster is an undefined behavior.

It's not very common to launch multiple ray clusters on a single physical node. You may need to do some research on the resource isolation / using some container tools.

hongpeng-guo avatar Apr 05 '25 01:04 hongpeng-guo

Thanks for the quick response. Is it possible launch two verl jobs in one cluster with different namespaces?

skpig avatar Apr 05 '25 01:04 skpig

As named actors are global unique in each cluster, launching two jobs in the same cluster is an undefined behavior.

I'm not very familiar with Ray. Is the reason that two jobs might instantiate actors with identical names, potentially causing conflicts?

skpig avatar Apr 05 '25 01:04 skpig

As named actors are global unique in each cluster, launching two jobs in the same cluster is an undefined behavior.

I'm not very familiar with Ray. Is the reason that two jobs might instantiate actors with identical names, potentially causing conflicts?

Yes, basically trying to create two named actors with the same name may cause some problem. You can try to apply different namespaces, but I am not sure if there might be other resource sharing related problem when you do it. verl doesn't support it but you can create a fork to have a try.

BTW, why do you need to run two verl jobs in a single cluster? You can create two separate Ray clusters if you have enough resources.

hongpeng-guo avatar Apr 05 '25 02:04 hongpeng-guo

why do you need to run two verl jobs in a single cluster?

It seems like launching two clusters in one node is not recommended :) Thanks anyway! I will give it a try and see how it goes.

skpig avatar Apr 05 '25 15:04 skpig