LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

Qwen: Deepspeed(Zero3) + DPO error

Open zzong2006 opened this issue 11 months ago • 10 comments

Reminder

  • [X] I have read the README and searched the existing issues.

Reproduction

keep failing before starting training dpo... (target model: Qwen-14b-chat)

what does assert len(set(t.dtype for t in tensors)) == 1 meaning?

deepspeed config (0.14.0v)

{
    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "steps_per_print": 1,
    "wall_clock_breakdown": false
}

Error Message

[2024-03-10 18:29:22,849] [INFO] [utils.py:800:see_memory_usage] Stage 3 initialize beginning
[2024-03-10 18:29:22,850] [INFO] [utils.py:801:see_memory_usage] MA 8.48 GB         Max_MA 11.39 GB         CA 33.85 GB         Max_CA 34 GB 
[2024-03-10 18:29:22,850] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 42.89 GB, percent = 4.3%
[2024-03-10 18:29:22,852] [INFO] [stage3.py:130:__init__] Reduce bucket size 500,000,000
[2024-03-10 18:29:22,852] [INFO] [stage3.py:131:__init__] Prefetch bucket size 50,000,000
[2024-03-10 18:29:23,060] [INFO] [utils.py:800:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-03-10 18:29:23,061] [INFO] [utils.py:801:see_memory_usage] MA 8.48 GB         Max_MA 8.48 GB         CA 33.85 GB         Max_CA 34 GB 
[2024-03-10 18:29:23,061] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 42.89 GB, percent = 4.3%
Parameter Offload: Total persistent parameters: 1029120 in 201 params
[2024-03-10 18:29:23,295] [INFO] [utils.py:800:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-03-10 18:29:23,296] [INFO] [utils.py:801:see_memory_usage] MA 8.48 GB         Max_MA 8.48 GB         CA 33.85 GB         Max_CA 34 GB 
[2024-03-10 18:29:23,297] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 42.89 GB, percent = 4.3%
[2024-03-10 18:29:23,505] [INFO] [utils.py:800:see_memory_usage] Before creating fp16 partitions
[2024-03-10 18:29:23,506] [INFO] [utils.py:801:see_memory_usage] MA 8.48 GB         Max_MA 8.48 GB         CA 33.85 GB         Max_CA 34 GB 
[2024-03-10 18:29:23,507] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 42.89 GB, percent = 4.3%

Traceback (most recent call last):

  File "/myenv/lib/python3.10/site-packages/llmtuner/train/tuner.py", line 37, in run_exp
    run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
  File "/myenv/lib/python3.10/site-packages/llmtuner/train/dpo/workflow.py", line 50, in run_dpo
    trainer = CustomDPOTrainer(
  File "/myenv/lib/python3.10/site-packages/llmtuner/train/dpo/trainer.py", line 60, in __init__
    self.ref_model = self._prepare_deepspeed(self.ref_model)
  File "/myenv/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 447, in _prepare_deepspeed
    model, *_ = deepspeed.initialize(model=model, config=config_kwargs)
  File "/myenv/lib/python3.10/site-packages/deepspeed/__init__.py", line 176, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/myenv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 307, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/myenv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1256, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/myenv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1579, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/myenv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 317, in __init__
    self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)
  File "/myenv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 697, in _create_fp16_partitions_with_defragmentation
    device_buffer = __class__.defragment(parameter_partitions)
  File "/myenv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 529, in defragment
    assert len(set(t.dtype for t in tensors)) == 1
AssertionError

Expected behavior

No response

System Info

  • python==3.10
  • llmtuner==0.5.3
  • deepspeed==0.14.0
  • transformers==4.38.2
  • cuda==12.3
  • hardware: A100 80gb 8ea

Others

No response

zzong2006 avatar Mar 10 '24 09:03 zzong2006

Training config

{
    "stage": "dpo",
    "do_train": true,
    "model_name_or_path": "(removed)",
    "dataset": "(removed)",
    "dataset_dir": "../data",
    "template": "qwen",
    "finetuning_type": "full",
    "output_dir": "(removed)",
    "overwrite_cache": true,
    "overwrite_output_dir": true,
    "cutoff_len": 4096,
    "val_size": 0,
    "evaluation_strategy": "no",
    "fp16_full_eval": true,
    "per_device_eval_batch_size": 4,
    "eval_accumulation_steps": 4,
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 16,
    "gradient_checkpointing": true,
    "save_only_model": true,
    "save_safetensors": true,
    "lr_scheduler_type": "cosine",
    "logging_steps": 1,
    "save_strategy": "epoch",
    "learning_rate": 1e-6,
    "save_total_limit": 1,
    "num_train_epochs": 1,
    "warmup_ratio": 0.05,
    "weight_decay": 0.01,
    "plot_loss": true,
    "accelerator_config": {
        "dispatch_batches": false
    },
    "use_fast_tokenizer": true,
    "resume_from_checkpoint": false,
    "report_to": "wandb",
    "deepspeed": "ds_zero3_dpo.json",
    "bf16": true
}

zzong2006 avatar Mar 10 '24 09:03 zzong2006

I have the same problem with qwen1.5-14b-chat DPO training, have you solved it yet?

Fenglly avatar Mar 11 '24 09:03 Fenglly

have you try? --fp16 True

stage3 config "fp16": { "enabled": true,

sparkfax avatar Mar 26 '24 14:03 sparkfax

have you try? --fp16 True

stage3 config "fp16": { "enabled": true,

sparkfax avatar Mar 26 '24 14:03 sparkfax

Hi, have you solved the issue? I have met the exact same problem.

XpastaX avatar May 23 '24 07:05 XpastaX

Hi, have you solved the issue? I have met the exact same problem. py3.10 llamafactory 0.7.2.dev0 torch 2.3.0 transformers 4.37.2 / 4.41 deepspeed 0.13.0/0.14.0 cuda 12.1

tried:

  1. --bf16
  2. --fp16

XpastaX avatar May 23 '24 08:05 XpastaX

Hi, have you solved the issue? I have met the exact same problem.

我也刚在尝试,发现只要用_prepare_deepspeed载入ref_model情况下,就会报错,开始zero2/zero3都尝试过 : (

HAOChuzhan avatar May 23 '24 08:05 HAOChuzhan

Hi, have you solved the issue? I have met the exact same problem.

我也刚在尝试,发现只要用_prepare_deepspeed载入ref_model情况下,就会报错,开始zero2/zero3都尝试过 : (

我去翻了翻源码,好像是trl的问题,它用ref_model初始化deepspeed的时候,会出问题。而且我的理解是ref_model应该会freeze不做训练,他直接ref_model=model,也没有copy,然后直接用refmodel初始化deepspeed。 不行试一试量化ref模型到4bit或者8bit吧。直接跳过deepspeed

XpastaX avatar May 23 '24 09:05 XpastaX

可能就是offload问题,我用stage3 no offoad就可以了。

deepspeed --master_port 25002 --include "localhost:4,5,6,7" src/train.py \
--model_name_or_path ${model_path} \
--stage 'dpo' \
--do_train \
--finetuning_type 'full' \
--dpo_ftx 1.0 \
--ddp_timeout 180000000 \
--deepspeed examples/deepspeed/ds_z3_config.json \
--dataset $dataset \
--template llama2 \
--cutoff_len 4096 \
--max_samples 10000000000 \
--overwrite_cache \
--output_dir xxxxx \
--logging_steps 10 \
--save_strategy 'no' \
--plot_loss \
--overwrite_output_dir true \
--per_device_train_batch_size  1 \
--gradient_accumulation_steps  8 \
--learning_rate  0.000005 \
--num_train_epochs  2.0 \
--lr_scheduler_type  cosine \
--warmup_steps 500  \
--fp16 |tee ${name}.log

XpastaX avatar May 24 '24 07:05 XpastaX

可能就是offload问题,我用stage3 no offoad就可以了。

deepspeed --master_port 25002 --include "localhost:4,5,6,7" src/train.py \
--model_name_or_path ${model_path} \
--stage 'dpo' \
--do_train \
--finetuning_type 'full' \
--dpo_ftx 1.0 \
--ddp_timeout 180000000 \
--deepspeed examples/deepspeed/ds_z3_config.json \
--dataset $dataset \
--template llama2 \
--cutoff_len 4096 \
--max_samples 10000000000 \
--overwrite_cache \
--output_dir xxxxx \
--logging_steps 10 \
--save_strategy 'no' \
--plot_loss \
--overwrite_output_dir true \
--per_device_train_batch_size  1 \
--gradient_accumulation_steps  8 \
--learning_rate  0.000005 \
--num_train_epochs  2.0 \
--lr_scheduler_type  cosine \
--warmup_steps 500  \
--fp16 |tee ${name}.log

我原本就是使用的ds_z3_config配置,但我在入参中指定了ref_model的路径,然后ref_model经过_prepare_deepspeed初始化后就会报错,那你现在这样ref_model经过_prepare_deepspeed初始化后不会报错了吗? 注:上述实验我是在lora下进行的

HAOChuzhan avatar May 24 '24 07:05 HAOChuzhan