alignment-handbook
alignment-handbook copied to clipboard
Multi-GPU Training with DPO Full Parameter Stucks
Environment:
transformers: 4.39.0.dev0 trl: 0.7.10 torch: 2.2.2 8 x H100 (80GB)
I am encountering an issue where the training process with DPO on a multi-GPU setup gets stuck. This problem arises when I attempt to launch the training using the accelerate CLI with DeepSpeed's ZeRO-3 configuration.
Steps to Reproduce:
Clone the Alignment Handbook repository:
git clone https://github.com/huggingface/alignment-handbook.git
cd alignment-handbook
Install dependencies:
pip install wheel
python -m pip install .
Launch the training script with the specified configuration:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full.yaml
Expected vs. Actual Behavior: Expected: Smooth utilization of multi-GPU for training without interruptions. Actual: The process halts immediately after displaying the user warning:
UserWarning: You passed a model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you.
Post this warning, there's no progression.