[BUG]output tensor must have the same type as input tensor in PPO training script of TRL
hi there, i train the model with TRL - ppo following https://github.com/huggingface/trl/blob/main/examples/scripts/ppo.py with the accelerate config:
deepspeed_zero3.yaml:
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
gradient_accumulation_steps: 4
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'yes'
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 6
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true
main_process_port: 29525
gpu_ids: 0,1,2,3,6,7
launch script:
accelerate launch \
--config_file=./trl/examples/accelerate_configs/deepspeed_zero3.yaml \
./trl/examples/scripts/ppo.py \
--model_name "mistralai/Mistral-7B-Instruct-v0.2" \
--optimize_cuda_cache True \
--batch_size 4 \
--gradient_accumulation_steps 4 \
--mini_batch_size 1 \
--log_with=wandb
it can distribute the model among multiple gpus. that is fine.
however, there is an issue within the ppo_trainer.generate
Traceback (most recent call last): File "/home/chenyanan/trl/examples/scripts/ppo_tp.py", line 208, in
response_tensors, ref_response_tensors = ppo_trainer.generate( File "/home/chenyanan/trl/trl/trainer/ppo_trainer.py", line 469, in generate response = self._generate_batched( File "/home/chenyanan/trl/trl/trainer/ppo_trainer.py", line 555, in _generate_batched with unwrap_model_for_generation(model, self.accelerator) as unwrapped_model: File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/contextlib.py", line 135, in enter return next(self.gen) File "/home/chenyanan/trl/trl/models/utils.py", line 151, in unwrap_model_for_generation with deepspeed.zero.GatheredParameters(model.parameters()): File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2169, in enter self.params[0].all_gather(param_list=self.params) File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1118, in all_gather return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy) File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1462, in _all_gather self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False) File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1766, in _allgather_params_coalesced h = dist.all_gather_into_tensor(allgather_params[param_idx], File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper return func(*args, **kwargs) File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op) File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn return fn(*args, **kwargs) File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 219, in all_gather_into_tensor return self.all_gather_function(output_tensor=output_tensor, File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper return func(*args, **kwargs) File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2709, in all_gather_into_tensor work = group._allgather_base(output_tensor, input_tensor, opts) TypeError: output tensor must have the same type as input tensor
I printed the batch["input_ids"] and found that they are on cuda:0
tensor([ 1, 995, 460, ..., 3177, 9116, 28747], device='cuda:0') torch.Size([7980])
tensor([ 1, 995, 460, ..., 3177, 9116, 28747], device='cuda:0') torch.Size([6101])
tensor([ 1, 995, 460, ..., 3177, 9116, 28747], device='cuda:0') torch.Size([8400])
any suggestions on how to apply deepspeed for multi-gpu ppo training ? thanks.
Use OpenRLHF https://github.com/OpenLLMAI/OpenRLHF (Based on DeepSpeed Ray and vLLM)
@hijkzzz thanks. does it support PPO with lengthy context/prompt, such as more than 6K or even 10K ? These examples cause crashes when using trl.
@hijkzzz thanks. does it support PPO with lengthy context/prompt, such as more than 6K or even 10K ? These examples cause crashes when using trl.
yes
+1,same mistake.
I have the same mistake