trl Deepspeed Zero2 not working when using DPOTrainer

System Info

transformers version: 4.44.2
Platform: Linux-5.15.0-1058-aws-x86_64-with-glibc2.31
Python version: 3.10.14
Huggingface_hub version: 0.24.6
Safetensors version: 0.4.5
Accelerate version: 0.34.2
Accelerate config: not found
PyTorch version (GPU?): 2.4.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: deepspeed
Using GPU in script?: 8 GPUs
GPU type: NVIDIA L40S

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder
[ ] My own task or dataset (give details below)

Reproduction

The accelerate config file I'm using

deepspeed_config.yaml

compute_environment: LOCAL_MACHINE
deepspeed_config:
 gradient_accumulation_steps: 1
 gradient_clipping: 1.0
 offload_optimizer_device: cpu
 offload_param_device: cpu
 zero3_init_flag: true
 zero_stage: 2
distributed_type: DEEPSPEED
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
use_cpu: false

The training script I'm using

train.py


import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
import pdb
import torch
import os
from accelerate import Accelerator


import warnings

model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b-it",attn_implementation='eager')
ref_model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b-it",attn_implementation='eager')

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")


dataset = load_dataset("json", data_files="dpo_train.json")

training_args = DPOConfig(
    report_to="none",
    output_dir="/data/models/gemma_dpo_checkpoints",
    per_device_train_batch_size=1,  
    num_train_epochs=3,
    logging_dir='/data/logs',
    logging_steps=5,
    save_steps=100,
    max_length = 1225,
    max_prompt_length = 1225,
    save_total_limit=2,
    dataloader_num_workers=4,
    bf16=True,  
)

# load trainer
trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=dataset["train"],
)

# train
trainer.train()

trainer.save_model("/data/models/gemma_dpo")

Run the script with accelerate

accelerate launch --config_file deepspeed_config.yaml train.py

Expected behavior

The Zero2 is not working(Set to Zero0)

Sep 12 '24 16:09 EQ3A2A

mark

Sep 18 '24 05:09 kechunFIVE

Hi did you solve this problem? Same problem here.

Sep 22 '24 23:09 TongLiu-github

Hello @EQ3A2A @TongLiu-github can you please share an example that reproduces the error with a public dataset I can test?

Sep 23 '24 07:09 lewtun

Hello @EQ3A2A @TongLiu-github can you please share an example that reproduces the error with a public dataset I can test?

Thanks for the reply. I solved this problem from: https://github.com/huggingface/accelerate/issues/314#issue-1201142707

Sep 23 '24 13:09 TongLiu-github

Thanks @TongLiu-github - do I understand correctly that you were experiencing a NCCL timeout error instead?

The reason you are seeing Stage 0 in the logs is because we initialise the initialise the reference model in this stage unless Stage 3 is set by the user: https://github.com/huggingface/trl/blob/2cad48d511fab99ac0c4b327195523a575afcad3/trl/trainer/dpo_trainer.py#L923

In the screenshot below, I compare DDP vs ZeRO-3 and one indeed sees the memory used by the latter is smaller.

Screenshot 2024-09-24 at 10 30 09

If that resolves the issue, feel free to close it.

Sep 24 '24 08:09 lewtun

Thanks @TongLiu-github - do I understand correctly that you were experiencing a NCCL timeout error instead?

The reason you are seeing Stage 0 in the logs is because we initialise the initialise the reference model in this stage unless Stage 3 is set by the user:

https://github.com/huggingface/trl/blob/2cad48d511fab99ac0c4b327195523a575afcad3/trl/trainer/dpo_trainer.py#L923

In the screenshot below, I compare DDP vs ZeRO-3 and one indeed sees the memory used by the latter is smaller.

If that resolves the issue, feel free to close it.

Hello @lewtun. So, if I understand correctly, this is just an issue with how the logs are displayed, and Zero2 is actually enabled, right?

Oct 11 '24 08:10 Joe-Hall-Lee

Hi @Joe-Hall-Lee yes that's correct: the logs from deepspeed are showing the initialisation of the reference model

Oct 17 '24 11:10 lewtun