verl icon indicating copy to clipboard operation
verl copied to clipboard

I am not familiar with Ray

Open guox18 opened this issue 9 months ago • 7 comments
trafficstars

i wonder how can i set CUDA_VISIBLE_DEVICES properly? (if i set CUDA_VISIBLE_DEVICES in my environment and run the training script, there is just no use)

guox18 avatar Feb 22 '25 16:02 guox18

Hi @guox18 , just wondering why you need to set CUDA_VISIBLE_DEVICES when running the scripts? And which script are you running with?

PeterSH6 avatar Feb 23 '25 05:02 PeterSH6

Hi @guox18, would you mind sharing more details about properly setting CUDA_VISIBLE_DEVICES? For example, how many actors do you have, and how do they map to GPU devices?

kevin85421 avatar Feb 24 '25 18:02 kevin85421

@PeterSH6, would you mind adding a "ray" label to this issue? I am triaging Ray-related issues in veRL. Thanks!

kevin85421 avatar Feb 24 '25 18:02 kevin85421

i wonder how can i set CUDA_VISIBLE_DEVICES properly? (if i set CUDA_VISIBLE_DEVICES in my environment and run the training script, there is just no use)

I think just set CUDA_VISIBLE_DEVICES in script is useful.

Image Image

BearBiscuit05 avatar Feb 25 '25 10:02 BearBiscuit05

@kevin85421 Done. Thanks!

PeterSH6 avatar Feb 25 '25 11:02 PeterSH6

Hi @guox18, would you mind sharing more details about properly setting CUDA_VISIBLE_DEVICES? For example, how many actors do you have, and how do they map to GPU devices?

@kevin85421 @PeterSH6 thanks for your reply.

i used a custom dataset, and ran the script below. i set CUDA_VISIBLE_DEVICES to avoid use gpu0, because gpu 0 was used by my classmate. however, the shell still tried to occupy the gpu0, which will give a out of memory error.

i don't quiet understand the device map. setting CUDA_VISIBLE_DEVICES is just not working for me.

[the script] `

set -x data_path=/cpfs01/shared/llm_ddd/guoxu/code CUDA_VISIBLE_DEVICES=2,3,4,5 cif_train_path=$data_path/data/cif/train.parquet cif_test_path=$data_path/data/cif/test.parquet train_files="['$cif_train_path']" test_files="['$cif_test_path']"

python3 -m verl.trainer.main_ppo
data.train_files="$train_files"
data.val_files="$test_files"
data.train_batch_size=16
data.val_batch_size=32
data.max_prompt_length=2048
data.max_response_length=2048
actor_rollout_ref.model.path=/cpfs01/shared/llm_ddd/guoxu/hf_hub/models/custom--Qwen2.5-7B-Instruct-tokenizer-modified/snapshots/bb46c15ee4bb56c5b63245ef50fd7637234d6f75_no_yarn
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.actor.ppo_mini_batch_size=8
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2
actor_rollout_ref.model.enable_gradient_checkpointing=True
actor_rollout_ref.actor.fsdp_config.param_offload=False
actor_rollout_ref.actor.fsdp_config.grad_offload=False
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16
actor_rollout_ref.rollout.tensor_model_parallel_size=1
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.gpu_memory_utilization=0.3
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16
actor_rollout_ref.ref.fsdp_config.param_offload=True
critic.optim.lr=1e-5
critic.model.use_remove_padding=True
critic.model.path=/cpfs01/shared/llm_ddd/guoxu/hf_hub/models/custom--Qwen2.5-7B-Instruct-tokenizer-modified/snapshots/bb46c15ee4bb56c5b63245ef50fd7637234d6f75_no_yarn
critic.model.enable_gradient_checkpointing=True
critic.ppo_micro_batch_size_per_gpu=2
critic.model.fsdp_config.param_offload=False
critic.model.fsdp_config.grad_offload=False
critic.model.fsdp_config.optimizer_offload=False
algorithm.kl_ctrl.kl_coef=0.001
trainer.critic_warmup=0
trainer.logger=wandb
trainer.project_name='verl_cif'
trainer.experiment_name='Qwen2.5-7B-Instruct_cif_ppo'
trainer.n_gpus_per_node=4
trainer.nnodes=1
trainer.save_freq=100
trainer.test_freq=50
trainer.total_epochs=20 $@ `

[git log -1]

  • commit 27484a7bbbfd585f7a2c45c24f097d54751d91ee (HEAD -> main, origin/main, origin/HEAD)

guox18 avatar Mar 01 '25 12:03 guox18

Hi @guox18, I don't know how does veRL launch a Ray cluster, but there are two methods:

  • Method 1: os.environ["CUDA_VISIBLE_DEVICES"] = "1,2,3"

    import os
    import ray
    import torch
    
    from ray.util.placement_group import placement_group
    from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
    
    os.environ["CUDA_VISIBLE_DEVICES"] = "1,2,3"
    ray.init()
    
    pg = placement_group([{"GPU": 1}, {"GPU": 1}, {"GPU": 1}])
    ray.get(pg.ready(), timeout=10)
    
    @ray.remote(num_gpus=1, num_cpus=0)
    def f():
        assert torch.cuda.device_count() == 1
        return os.environ["CUDA_VISIBLE_DEVICES"]
    
    
    # Create an actor to a placement group.
    tasks = [
        f.options(
            scheduling_strategy=PlacementGroupSchedulingStrategy(
                placement_group=pg,
            )
        ).remote()
        for _ in range(3)
    ]
    
    print(ray.get(tasks))
    # [Example output]:
    # 2025-03-03 02:09:59,645 INFO worker.py:1832 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
    # ['2', '3', '1'] --> GPU 0 is not used.
    
  • Method 2:

    • CUDA_VISIBLE_DEVICES=1,2,3 ray start --head --num-gpus=3: Launch a Ray node with 3 GPUs (GPU1, GPU2, GPU3)
    • python3 test.py: Submit Ray tasks to the existing Ray cluster.
      import os
      import ray
      import torch
      
      from ray.util.placement_group import placement_group
      from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
      
      ray.init()
      
      pg = placement_group([{"GPU": 1}, {"GPU": 1}, {"GPU": 1}])
      ray.get(pg.ready(), timeout=10)
      
      @ray.remote(num_gpus=1, num_cpus=0)
      def f():
          assert torch.cuda.device_count() == 1
          return os.environ["CUDA_VISIBLE_DEVICES"]
      
      
      # Create an actor to a placement group.
      tasks = [
          f.options(
              scheduling_strategy=PlacementGroupSchedulingStrategy(
                  placement_group=pg,
              )
          ).remote()
          for _ in range(3)
      ]
      
      print(ray.get(tasks))
      
      # [Example output]
      # 2025-03-03 02:13:26,938 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 172.31.9.244:6379...
      # 2025-03-03 02:13:26,947 INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
      # ['3', '1', '2']
      

kevin85421 avatar Mar 03 '25 02:03 kevin85421