verl I am not familiar with Ray

trafficstars

i wonder how can i set CUDA_VISIBLE_DEVICES properly? (if i set CUDA_VISIBLE_DEVICES in my environment and run the training script, there is just no use)

Feb 22 '25 16:02 guox18

Hi @guox18 , just wondering why you need to set CUDA_VISIBLE_DEVICES when running the scripts? And which script are you running with?

Feb 23 '25 05:02 PeterSH6

Hi @guox18, would you mind sharing more details about properly setting CUDA_VISIBLE_DEVICES? For example, how many actors do you have, and how do they map to GPU devices?

Feb 24 '25 18:02 kevin85421

@PeterSH6, would you mind adding a "ray" label to this issue? I am triaging Ray-related issues in veRL. Thanks!

Feb 24 '25 18:02 kevin85421

i wonder how can i set CUDA_VISIBLE_DEVICES properly? (if i set CUDA_VISIBLE_DEVICES in my environment and run the training script, there is just no use)

I think just set CUDA_VISIBLE_DEVICES in script is useful.

Feb 25 '25 10:02 BearBiscuit05

@kevin85421 Done. Thanks!

Feb 25 '25 11:02 PeterSH6

Hi @guox18, would you mind sharing more details about properly setting CUDA_VISIBLE_DEVICES? For example, how many actors do you have, and how do they map to GPU devices?

@kevin85421 @PeterSH6 thanks for your reply.

i used a custom dataset, and ran the script below. i set CUDA_VISIBLE_DEVICES to avoid use gpu0, because gpu 0 was used by my classmate. however, the shell still tried to occupy the gpu0, which will give a out of memory error.

i don't quiet understand the device map. setting CUDA_VISIBLE_DEVICES is just not working for me.

[the script] `

set -x data_path=/cpfs01/shared/llm_ddd/guoxu/code CUDA_VISIBLE_DEVICES=2,3,4,5 cif_train_path=$data_path/data/cif/train.parquet cif_test_path=$data_path/data/cif/test.parquet train_files="['$cif_train_path']" test_files="['$cif_test_path']"

python3 -m verl.trainer.main_ppo
data.train_files="$train_files"
data.val_files="$test_files"
data.train_batch_size=16
data.val_batch_size=32
data.max_prompt_length=2048
data.max_response_length=2048
actor_rollout_ref.model.path=/cpfs01/shared/llm_ddd/guoxu/hf_hub/models/custom--Qwen2.5-7B-Instruct-tokenizer-modified/snapshots/bb46c15ee4bb56c5b63245ef50fd7637234d6f75_no_yarn
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.actor.ppo_mini_batch_size=8
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2
actor_rollout_ref.model.enable_gradient_checkpointing=True
actor_rollout_ref.actor.fsdp_config.param_offload=False
actor_rollout_ref.actor.fsdp_config.grad_offload=False
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16
actor_rollout_ref.rollout.tensor_model_parallel_size=1
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.gpu_memory_utilization=0.3
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16
actor_rollout_ref.ref.fsdp_config.param_offload=True
critic.optim.lr=1e-5
critic.model.use_remove_padding=True
critic.model.path=/cpfs01/shared/llm_ddd/guoxu/hf_hub/models/custom--Qwen2.5-7B-Instruct-tokenizer-modified/snapshots/bb46c15ee4bb56c5b63245ef50fd7637234d6f75_no_yarn
critic.model.enable_gradient_checkpointing=True
critic.ppo_micro_batch_size_per_gpu=2
critic.model.fsdp_config.param_offload=False
critic.model.fsdp_config.grad_offload=False
critic.model.fsdp_config.optimizer_offload=False
algorithm.kl_ctrl.kl_coef=0.001
trainer.critic_warmup=0
trainer.logger=wandb
trainer.project_name='verl_cif'
trainer.experiment_name='Qwen2.5-7B-Instruct_cif_ppo'
trainer.n_gpus_per_node=4
trainer.nnodes=1
trainer.save_freq=100
trainer.test_freq=50
trainer.total_epochs=20 $@ `

[git log -1]

commit 27484a7bbbfd585f7a2c45c24f097d54751d91ee (HEAD -> main, origin/main, origin/HEAD)

Mar 01 '25 12:03 guox18

Hi @guox18, I don't know how does veRL launch a Ray cluster, but there are two methods:

Method 1: os.environ["CUDA_VISIBLE_DEVICES"] = "1,2,3"

import os
import ray
import torch

from ray.util.placement_group import placement_group
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy

os.environ["CUDA_VISIBLE_DEVICES"] = "1,2,3"
ray.init()

pg = placement_group([{"GPU": 1}, {"GPU": 1}, {"GPU": 1}])
ray.get(pg.ready(), timeout=10)

@ray.remote(num_gpus=1, num_cpus=0)
def f():
    assert torch.cuda.device_count() == 1
    return os.environ["CUDA_VISIBLE_DEVICES"]


# Create an actor to a placement group.
tasks = [
    f.options(
        scheduling_strategy=PlacementGroupSchedulingStrategy(
            placement_group=pg,
        )
    ).remote()
    for _ in range(3)
]

print(ray.get(tasks))
# [Example output]:
# 2025-03-03 02:09:59,645 INFO worker.py:1832 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
# ['2', '3', '1'] --> GPU 0 is not used.

Method 2:

CUDA_VISIBLE_DEVICES=1,2,3 ray start --head --num-gpus=3: Launch a Ray node with 3 GPUs (GPU1, GPU2, GPU3)

python3 test.py: Submit Ray tasks to the existing Ray cluster.

import os
import ray
import torch

from ray.util.placement_group import placement_group
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy

ray.init()

pg = placement_group([{"GPU": 1}, {"GPU": 1}, {"GPU": 1}])
ray.get(pg.ready(), timeout=10)

@ray.remote(num_gpus=1, num_cpus=0)
def f():
    assert torch.cuda.device_count() == 1
    return os.environ["CUDA_VISIBLE_DEVICES"]


# Create an actor to a placement group.
tasks = [
    f.options(
        scheduling_strategy=PlacementGroupSchedulingStrategy(
            placement_group=pg,
        )
    ).remote()
    for _ in range(3)
]

print(ray.get(tasks))

# [Example output]
# 2025-03-03 02:13:26,938 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 172.31.9.244:6379...
# 2025-03-03 02:13:26,947 INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
# ['3', '1', '2']

Mar 03 '25 02:03 kevin85421

verl verl copied to clipboard

I am not familiar with Ray

verl
verl copied to clipboard