verl
verl copied to clipboard
I am not familiar with Ray
i wonder how can i set CUDA_VISIBLE_DEVICES properly? (if i set CUDA_VISIBLE_DEVICES in my environment and run the training script, there is just no use)
Hi @guox18 , just wondering why you need to set CUDA_VISIBLE_DEVICES when running the scripts? And which script are you running with?
Hi @guox18, would you mind sharing more details about properly setting CUDA_VISIBLE_DEVICES? For example, how many actors do you have, and how do they map to GPU devices?
@PeterSH6, would you mind adding a "ray" label to this issue? I am triaging Ray-related issues in veRL. Thanks!
i wonder how can i set CUDA_VISIBLE_DEVICES properly? (if i set CUDA_VISIBLE_DEVICES in my environment and run the training script, there is just no use)
I think just set CUDA_VISIBLE_DEVICES in script is useful.
@kevin85421 Done. Thanks!
Hi @guox18, would you mind sharing more details about properly setting CUDA_VISIBLE_DEVICES? For example, how many actors do you have, and how do they map to GPU devices?
@kevin85421 @PeterSH6 thanks for your reply.
i used a custom dataset, and ran the script below. i set CUDA_VISIBLE_DEVICES to avoid use gpu0, because gpu 0 was used by my classmate. however, the shell still tried to occupy the gpu0, which will give a out of memory error.
i don't quiet understand the device map. setting CUDA_VISIBLE_DEVICES is just not working for me.
[the script] `
set -x data_path=/cpfs01/shared/llm_ddd/guoxu/code CUDA_VISIBLE_DEVICES=2,3,4,5 cif_train_path=$data_path/data/cif/train.parquet cif_test_path=$data_path/data/cif/test.parquet train_files="['$cif_train_path']" test_files="['$cif_test_path']"
python3 -m verl.trainer.main_ppo
data.train_files="$train_files"
data.val_files="$test_files"
data.train_batch_size=16
data.val_batch_size=32
data.max_prompt_length=2048
data.max_response_length=2048
actor_rollout_ref.model.path=/cpfs01/shared/llm_ddd/guoxu/hf_hub/models/custom--Qwen2.5-7B-Instruct-tokenizer-modified/snapshots/bb46c15ee4bb56c5b63245ef50fd7637234d6f75_no_yarn
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.actor.ppo_mini_batch_size=8
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2
actor_rollout_ref.model.enable_gradient_checkpointing=True
actor_rollout_ref.actor.fsdp_config.param_offload=False
actor_rollout_ref.actor.fsdp_config.grad_offload=False
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16
actor_rollout_ref.rollout.tensor_model_parallel_size=1
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.gpu_memory_utilization=0.3
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16
actor_rollout_ref.ref.fsdp_config.param_offload=True
critic.optim.lr=1e-5
critic.model.use_remove_padding=True
critic.model.path=/cpfs01/shared/llm_ddd/guoxu/hf_hub/models/custom--Qwen2.5-7B-Instruct-tokenizer-modified/snapshots/bb46c15ee4bb56c5b63245ef50fd7637234d6f75_no_yarn
critic.model.enable_gradient_checkpointing=True
critic.ppo_micro_batch_size_per_gpu=2
critic.model.fsdp_config.param_offload=False
critic.model.fsdp_config.grad_offload=False
critic.model.fsdp_config.optimizer_offload=False
algorithm.kl_ctrl.kl_coef=0.001
trainer.critic_warmup=0
trainer.logger=wandb
trainer.project_name='verl_cif'
trainer.experiment_name='Qwen2.5-7B-Instruct_cif_ppo'
trainer.n_gpus_per_node=4
trainer.nnodes=1
trainer.save_freq=100
trainer.test_freq=50
trainer.total_epochs=20 $@
`
[git log -1]
- commit 27484a7bbbfd585f7a2c45c24f097d54751d91ee (HEAD -> main, origin/main, origin/HEAD)
Hi @guox18, I don't know how does veRL launch a Ray cluster, but there are two methods:
-
Method 1:
os.environ["CUDA_VISIBLE_DEVICES"] = "1,2,3"import os import ray import torch from ray.util.placement_group import placement_group from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy os.environ["CUDA_VISIBLE_DEVICES"] = "1,2,3" ray.init() pg = placement_group([{"GPU": 1}, {"GPU": 1}, {"GPU": 1}]) ray.get(pg.ready(), timeout=10) @ray.remote(num_gpus=1, num_cpus=0) def f(): assert torch.cuda.device_count() == 1 return os.environ["CUDA_VISIBLE_DEVICES"] # Create an actor to a placement group. tasks = [ f.options( scheduling_strategy=PlacementGroupSchedulingStrategy( placement_group=pg, ) ).remote() for _ in range(3) ] print(ray.get(tasks)) # [Example output]: # 2025-03-03 02:09:59,645 INFO worker.py:1832 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 # ['2', '3', '1'] --> GPU 0 is not used. -
Method 2:
CUDA_VISIBLE_DEVICES=1,2,3 ray start --head --num-gpus=3: Launch a Ray node with 3 GPUs (GPU1, GPU2, GPU3)python3 test.py: Submit Ray tasks to the existing Ray cluster.import os import ray import torch from ray.util.placement_group import placement_group from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy ray.init() pg = placement_group([{"GPU": 1}, {"GPU": 1}, {"GPU": 1}]) ray.get(pg.ready(), timeout=10) @ray.remote(num_gpus=1, num_cpus=0) def f(): assert torch.cuda.device_count() == 1 return os.environ["CUDA_VISIBLE_DEVICES"] # Create an actor to a placement group. tasks = [ f.options( scheduling_strategy=PlacementGroupSchedulingStrategy( placement_group=pg, ) ).remote() for _ in range(3) ] print(ray.get(tasks)) # [Example output] # 2025-03-03 02:13:26,938 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 172.31.9.244:6379... # 2025-03-03 02:13:26,947 INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 # ['3', '1', '2']