DeepSpeed [BUG] is_zero_init_model is always False when I'm using zero

Describe the bug When I'm fine tuning llama2 with deepspeed zero3, I set "zero3_init_flag: true" in my accelerate config. The "is_deepspeed_zero3_enabled()" in transformers/integrations/deepspeed.py is also judged to True. But the "is_zero_init_model" is judged to False in _configure_distributed_model of deepspeed/runtime/engine.py. I'm not sure if it abnormal?

To Reproduce Here is my code:

from datasets import load_dataset
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer, TrainingArguments
import bitsandbytes as bnb
from peft import LoraConfig
from trl import SFTTrainer

base_model_name ="/home/yangtong/data/llama2-hf/llama2-13b-chat_hf"

dataset = load_dataset("json",data_files="Belle_open_source_0.5M_changed.json",split="train")

result_dir = "tmp"
training_args = TrainingArguments(
    report_to="none",
    output_dir=result_dir, 
    # per_device_train_batch_size * gradient_accumulation_steps = batch_size
    per_device_train_batch_size=1, 
    gradient_accumulation_steps=16, 
    learning_rate=2e-4, 
    logging_steps=10, 
    # max_steps=520, 
    num_train_epochs=0.016, 
    save_steps=500, 
    bf16 = True,  # set bf16 to True with an A100
    # optim='paged_adamw_32bit',
    gradient_checkpointing=True
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, 
    bnb_4bit_use_double_quant=True, 
    bnb_4bit_quant_type="nf4", 
    bnb_4bit_compute_dtype=torch.bfloat16, 
)

base_model = LlamaForCausalLM.from_pretrained(
    base_model_name, 
    quantization_config=bnb_config, 
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names: # needed for 16-bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)
models=find_all_linear_names(base_model)

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=models
)

tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.deprecation_warnings["Asking-to-pad-a-fast-tokenizer"] = True
tokenizer.pad_token = tokenizer.eos_token

max_seq_length = 512  
trainer = SFTTrainer(
    model=base_model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_args
)

trainer.train()

output_dir = os.path.join(result_dir, "final_checkpoint")
trainer.model.save_pretrained(output_dir)

Here is my accelerate config:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: /home/yangtong/ft_dis/ds_config/3.json
  zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 4
rdzv_backend: 'c10d'
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Here is my deepspeed config:

{
  "optimizer": {
    "type": "AdamW",
    "params": {
        "lr": 2e-4,
        "betas": [
          0.9,
          0.999
        ],
        "eps": "auto",
        "weight_decay": "auto",
        "adam_w_mode": true,
        "torch_adam": true
    }
  },
  
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
        "warmup_min_lr": "auto",
        "warmup_max_lr": "auto",
        "warmup_num_steps": "auto",
        "total_num_steps": "auto"
    }
  },
  
  "zero_optimization": {
    "stage": 3,
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true,
    "overlap_comm": true,
    "offload_optimizer": {
      "device": "none",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "sub_group_size": 1e9,
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "bf16": {
    "enabled": true
  },
  "gradient_clipping": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 16,
  "wall_clock_breakdown": false
}

Expected behavior Parameters first partition and then load to GPUs.

System info (please complete the following information):

OS: Ubuntu 22.04.4 LTS (Linux 5.15.0-106-generic)
GPU count and types 2 x Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
Python version 3.10.13
Pytorch version 2.2.2
CUDA version 11.8.0
bitsandbytes==0.43.0
huggingface_hub==0.23.2
accelerate==0.30.1
transformers==4.41.1
peft==0.9.0
deepspeed==0.14.0

Launcher context

accelerate launch \
--config_file "config/z3_3.yaml" \
--num_processes 1 \
ft_acc.py

Here is engine.py:

I will truly appreciate if anyone can help me solve it ! @loadams @tjruwase @deepcharm

Jun 08 '24 08:06 CHNRyan

Maybe this link will help, https://huggingface.co/docs/transformers/main/en/deepspeed?models=pretrained+model#non-trainer-deepspeed-integration

Jun 09 '24 15:06 Taiinguyenn139

Maybe this link will help, https://huggingface.co/docs/transformers/main/en/deepspeed?models=pretrained+model#non-trainer-deepspeed-integration

@Taiinguyenn139 Thanks for your reply! I have tried it but I lose. Here is my code, maybe it is not correct:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7'

from datasets import load_dataset
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer, TrainingArguments
import bitsandbytes as bnb
from peft import LoraConfig
from trl import SFTTrainer

from accelerate import Accelerator
accelerator = Accelerator()

from transformers.integrations import HfDeepSpeedConfig
import deepspeed
ds_config = "ds_config/3.json"
dschf = HfDeepSpeedConfig(ds_config) 

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, 
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4", 
    bnb_4bit_compute_dtype=torch.bfloat16, 
)

base_model_name ="/home/yangtong/data/llama2-hf/llama2-13b-chat_hf"
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config, 
    torch_dtype=torch.bfloat16
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1
# engine = deepspeed.initialize(model=base_model, config_params=ds_config)

dataset = load_dataset("json",data_files="Belle_open_source_0.5M_changed.json",split="train")

result_dir = "tmp"
training_args = TrainingArguments(
    report_to="wandb",
    output_dir=result_dir, 
    # per_device_train_batch_size * gradient_accumulation_steps = batch_size
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-4,
    logging_steps=10, 
    # max_steps=520, 
    num_train_epochs=0.037,
    save_steps=500,  # 65
    bf16 = True,
    # optim='paged_adamw_32bit',
    gradient_checkpointing=True,
    # group_by_length=True,
    # remove_unused_columns=False,
    # warmup_ratio=0.03,
    # lr_scheduler_type='constant',
    # max_grad_norm=0.3
)

models = ['v_proj', 'gate_proj', 'down_proj', 'k_proj', 'q_proj', 'o_proj', 'up_proj']

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=models
)

tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.deprecation_warnings["Asking-to-pad-a-fast-tokenizer"] = True
tokenizer.pad_token = tokenizer.eos_token

max_seq_length = 512  
trainer = SFTTrainer(
    model=base_model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_args
)

trainer.train()

output_dir = os.path.join(result_dir, "final_checkpoint")
trainer.model.save_pretrained(output_dir)
# trainer.save_model(output_dir)  # Stage-3

I have some questions about the way your provide: (1) The situation in this link is "Non-Trainer DeepSpeed integration". I'm wondering I use SFTtrainer in my code, isn't it attribute to Trainer? (2) I'm using accelerate, and I set TrainingArguments before from_pretrained in my origin code refer to https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/deepspeed#constructing-massive-models:~:text=If%20you%20want%20to%20use%20a,is%20how%20example%20scripts%20are%20written.. Is necessery to set HfDeepSpeedConfig?

Jun 10 '24 12:06 CHNRyan

(1) In my experience, you can run ZeRO 3 with SFTrainer or Trainer (2) I dont use accelerate but I use deepspeed command like this

deepspeed train.py

You don't need to set HfDeepSpeedConfig

(3) To more clearly, ZeRO stage 3 won't shard your params because your are using QLoRA, as discussed in this post https://www.reddit.com/r/LocalLLaMA/comments/1ai5mv3/thoughts_on_qlora_with_fsdp/ It's just offload your params to CPU only. So is_zero_init_model is always False maybe expected behavior.

Jun 10 '24 13:06 Taiinguyenn139

(1) In my experience, you can run ZeRO 3 with SFTrainer or Trainer (2) I dont use accelerate but I use deepspeed command like this
deepspeed train.py
You don't need to set HfDeepSpeedConfig

(3) To more clearly, ZeRO stage 3 won't shard your params because your are using QLoRA, as discussed in this post https://www.reddit.com/r/LocalLLaMA/comments/1ai5mv3/thoughts_on_qlora_with_fsdp/ It's just offload your params to CPU only. So is_zero_init_model is always False maybe expected behavior.

@Taiinguyenn139 Thank you for your help! I'm using SFTrainer so I think I don't need HFDeepSpeedConfig. And using my origin code and command "accelerate launch --config_file "config/z3_3.yaml" --num_processes 1 ft_acc.py" is entirly equal to "deepspeed ft_acc.py" with "deepspeed="config_path"" added in TrainingArguments. And based on the link you provided, I try to use zero3+lora instead zero3+qlora (just remove bnb_config = BitsAndBytesConfig(...) ). Then it magically worked! Parameters first shard then load to each GPU! It looks like zero3_init don't support qlora, but except this link, I didn't search any information about that. Maybe I'll open another issue to ask this question. And I'll truly appreciate if you have any other info help me!

Jun 11 '24 05:06 CHNRyan

@Taiinguyenn139, thanks for helping to resolve this issue.

Closing this issue.

Aug 03 '24 16:08 tjruwase

DeepSpeed
DeepSpeed copied to clipboard

[BUG] is_zero_init_model is always False when I'm using zero_init!

DeepSpeed DeepSpeed copied to clipboard

[BUG] is_zero_init_model is always False when I'm using zero_init!

DeepSpeed
DeepSpeed copied to clipboard