DeepSpeed [BUG]output tensor must have the same type as input tensor in PPO training script of TRL

hi there, i train the model with TRL - ppo following https://github.com/huggingface/trl/blob/main/examples/scripts/ppo.py with the accelerate config:

deepspeed_zero3.yaml:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 4
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'yes'
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 6
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true
main_process_port: 29525
gpu_ids: 0,1,2,3,6,7

launch script:

accelerate launch \
    --config_file=./trl/examples/accelerate_configs/deepspeed_zero3.yaml \
                  ./trl/examples/scripts/ppo.py \
        --model_name  "mistralai/Mistral-7B-Instruct-v0.2"  \
        --optimize_cuda_cache True \
        --batch_size 4 \
        --gradient_accumulation_steps 4 \
        --mini_batch_size 1 \
        --log_with=wandb

it can distribute the model among multiple gpus. that is fine.

however, there is an issue within the ppo_trainer.generate

Traceback (most recent call last): File "/home/chenyanan/trl/examples/scripts/ppo_tp.py", line 208, in response_tensors, ref_response_tensors = ppo_trainer.generate( File "/home/chenyanan/trl/trl/trainer/ppo_trainer.py", line 469, in generate response = self._generate_batched( File "/home/chenyanan/trl/trl/trainer/ppo_trainer.py", line 555, in _generate_batched with unwrap_model_for_generation(model, self.accelerator) as unwrapped_model: File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/contextlib.py", line 135, in enter return next(self.gen) File "/home/chenyanan/trl/trl/models/utils.py", line 151, in unwrap_model_for_generation with deepspeed.zero.GatheredParameters(model.parameters()): File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2169, in enter self.params[0].all_gather(param_list=self.params) File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1118, in all_gather return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy) File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1462, in _all_gather self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False) File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1766, in _allgather_params_coalesced h = dist.all_gather_into_tensor(allgather_params[param_idx], File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper return func(*args, **kwargs) File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op) File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn return fn(*args, **kwargs) File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 219, in all_gather_into_tensor return self.all_gather_function(output_tensor=output_tensor, File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper return func(*args, **kwargs) File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2709, in all_gather_into_tensor work = group._allgather_base(output_tensor, input_tensor, opts) TypeError: output tensor must have the same type as input tensor

Apr 22 '24 21:04 yanan1116

I printed the batch["input_ids"] and found that they are on cuda:0

tensor([    1,   995,   460,  ...,  3177,  9116, 28747], device='cuda:0') torch.Size([7980])
tensor([    1,   995,   460,  ...,  3177,  9116, 28747], device='cuda:0') torch.Size([6101])
tensor([    1,   995,   460,  ...,  3177,  9116, 28747], device='cuda:0') torch.Size([8400])

any suggestions on how to apply deepspeed for multi-gpu ppo training ? thanks.

Apr 22 '24 21:04 yanan1116

Use OpenRLHF https://github.com/OpenLLMAI/OpenRLHF (Based on DeepSpeed Ray and vLLM)

Apr 29 '24 14:04 hijkzzz

@hijkzzz thanks. does it support PPO with lengthy context/prompt, such as more than 6K or even 10K ? These examples cause crashes when using trl.

Apr 29 '24 14:04 yanan1116

@hijkzzz thanks. does it support PPO with lengthy context/prompt, such as more than 6K or even 10K ? These examples cause crashes when using trl.

yes

Apr 29 '24 18:04 hijkzzz

+1，same mistake.

Jun 28 '24 06:06 Minami-su

I have the same mistake

Sep 04 '24 13:09 Y-Sui