Deepspeed Zero2 not working when using DPOTrainer
System Info
transformersversion: 4.44.2- Platform: Linux-5.15.0-1058-aws-x86_64-with-glibc2.31
- Python version: 3.10.14
- Huggingface_hub version: 0.24.6
- Safetensors version: 0.4.5
- Accelerate version: 0.34.2
- Accelerate config: not found
- PyTorch version (GPU?): 2.4.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: deepspeed
- Using GPU in script?: 8 GPUs
- GPU type: NVIDIA L40S
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder - [ ] My own task or dataset (give details below)
Reproduction
The accelerate config file I'm using
deepspeed_config.yaml
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero_stage: 2
distributed_type: DEEPSPEED
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
use_cpu: false
The training script I'm using
train.py
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
import pdb
import torch
import os
from accelerate import Accelerator
import warnings
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b-it",attn_implementation='eager')
ref_model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b-it",attn_implementation='eager')
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")
dataset = load_dataset("json", data_files="dpo_train.json")
training_args = DPOConfig(
report_to="none",
output_dir="/data/models/gemma_dpo_checkpoints",
per_device_train_batch_size=1,
num_train_epochs=3,
logging_dir='/data/logs',
logging_steps=5,
save_steps=100,
max_length = 1225,
max_prompt_length = 1225,
save_total_limit=2,
dataloader_num_workers=4,
bf16=True,
)
# load trainer
trainer = DPOTrainer(
model=model,
ref_model=ref_model,
tokenizer=tokenizer,
args=training_args,
train_dataset=dataset["train"],
)
# train
trainer.train()
trainer.save_model("/data/models/gemma_dpo")
Run the script with accelerate
accelerate launch --config_file deepspeed_config.yaml train.py
Expected behavior
The Zero2 is not working(Set to Zero0)
mark
Hi did you solve this problem? Same problem here.
Hello @EQ3A2A @TongLiu-github can you please share an example that reproduces the error with a public dataset I can test?
Hello @EQ3A2A @TongLiu-github can you please share an example that reproduces the error with a public dataset I can test?
Thanks for the reply. I solved this problem from: https://github.com/huggingface/accelerate/issues/314#issue-1201142707
Thanks @TongLiu-github - do I understand correctly that you were experiencing a NCCL timeout error instead?
The reason you are seeing Stage 0 in the logs is because we initialise the initialise the reference model in this stage unless Stage 3 is set by the user: https://github.com/huggingface/trl/blob/2cad48d511fab99ac0c4b327195523a575afcad3/trl/trainer/dpo_trainer.py#L923
In the screenshot below, I compare DDP vs ZeRO-3 and one indeed sees the memory used by the latter is smaller.
If that resolves the issue, feel free to close it.
Thanks @TongLiu-github - do I understand correctly that you were experiencing a NCCL timeout error instead?
The reason you are seeing Stage 0 in the logs is because we initialise the initialise the reference model in this stage unless Stage 3 is set by the user:
https://github.com/huggingface/trl/blob/2cad48d511fab99ac0c4b327195523a575afcad3/trl/trainer/dpo_trainer.py#L923
In the screenshot below, I compare DDP vs ZeRO-3 and one indeed sees the memory used by the latter is smaller.
If that resolves the issue, feel free to close it.
Hello @lewtun. So, if I understand correctly, this is just an issue with how the logs are displayed, and Zero2 is actually enabled, right?
Hi @Joe-Hall-Lee yes that's correct: the logs from deepspeed are showing the initialisation of the reference model
