PPOV2 Trainner use Deepspeed Zero3 Offload CPU: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu!
@younesbelkada @lvwerra @lewtun @kashif @vwxyzjn @edbeeching @qgallouedec @Michellehbn Hi, I use PPOV2 trainer for PPO and run it according to the command given in examples/scripts/ppo/ppo.py, but set offload_optimizer_device to CPU and offload_param_device to CPU (using Deepspeed Zero3 Offload CPU) in deepseed_zero3.yaml , and no other changes. And the following error occurs: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu!
Reproduce the above error :
Following is my deepspeed_zero3.yaml config (Note: offload_optimizer_device: cpu , offload_param_device: cpu ):
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Following is my command :
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
examples/scripts/ppo/ppo.py \
--output_dir models/minimal/ppo \
--num_ppo_epochs 1 \
--num_mini_batches 1 \
--learning_rate 3e-6 \
--per_device_train_batch_size 5 \
--gradient_accumulation_steps 1 \
--total_episodes 10000 \
--model_name_or_path EleutherAI/pythia-1b-deduped \
--sft_model_path EleutherAI/pythia-1b-deduped \
--reward_model_path EleutherAI/pythia-1b-deduped \
--local_rollout_forward_batch_size 5 \
--non_eos_penalty \
Following is errors:
Traceback (most recent call last):
File "/examples/scripts/ppo/ppo.py", line 115, in <module>
trainer.train()
File "trl/trainer/ppov2_trainer.py", line 494, in train
**accelerator.backward(loss)**
File "/lib/python3.10/site-packages/accelerate/accelerator.py", line 2151, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
self.engine.step()
File "/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2169, in step
self._take_model_step(lr_kwargs)
File "/python3.10/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
self.optimizer.step()
File "/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)
File "/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
https://github.com/huggingface/trl/blob/ddf4c8dc3ecf6d9ee2b24f94c62182ffd682c808/trl/trainer/ppov2_trainer.py#L472 Note when I run the trl/trainer/ppov2_trainer.py: accelerator.backward(loss) , the error will be appear : RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Thanks a lot!
I have met the same problem with DPO + fsdp. With ZeRO3, it seems that it doen not work, the GPU memory occupation is not correct.
Same error in KTO when set offload_optimizer_device to CPU and offload_param_device to CPU. Have you solved it ?
Hello @supermancmk thanks for raising the issue! I am not able to reproduce it for some reason using your config and your command with --non_eos_penalty replacing --missing_eos_penalty 1.0 (a recent refactor) 🤔
For reference, I'm running on commit 92eea1f2390fcf3c1a7c4338dfa2e574ce3374c2 and have the following env:
- `transformers` version: 4.44.2
- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- Python version: 3.10.14
- Huggingface_hub version: 0.25.0
- Safetensors version: 0.4.4
- Accelerate version: 0.34.0
- Accelerate config: not found
- DeepSpeed version: 0.15.1
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA H100 80GB HBM3
Could you try updating your deepspeed and trl versions and see if the problem persists?
Closing as not reproduced and no recent activity.