DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Apply Zero-3 and LoRA appears empty lora weight [0]

Open jiangxinke opened this issue 8 months ago • 1 comments

System Info

accelerate 1.6.0 peft 0.15.0 transformers 4.51.3 deepspeed 0.16.5

Information

The official example scripts

My own modified scripts Tasks

An officially supported task in the examples folder

My own task or dataset (give details below) Reproduction

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from accelerate import Accelerator
import torch
from torch.utils.data import Dataset, DataLoader


class DummyDataset(Dataset):
    def __init__(self, tokenizer, dummy_text="Hello, world!", num_samples=100):
        self.tokenizer = tokenizer
        self.dummy_text = dummy_text
        self.num_samples = num_samples

    def __len__(self):
        return self.num_samples

    def __getitem__(self, idx):
        encoded = self.tokenizer(self.dummy_text, return_tensors="pt")
        item = {key: val.squeeze(0) for key, val in encoded.items()}
        return item


accelerator = Accelerator()

model_name = "/home/clouduser/jxk/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

lora_config = LoraConfig(r=8, lora_alpha=32, lora_dropout=0.1, bias="none", task_type="CAUSAL_LM")
model = get_peft_model(model, lora_config)

optimizer = torch.optim.Adam(model.parameters(), lr=5e-5)
dummy_dataset = DummyDataset(tokenizer, dummy_text="Hello, world!", num_samples=100)
dataloader = DataLoader(dummy_dataset, batch_size=4, shuffle=True)


print("++++" * 100)
policy_state_dict = model.state_dict()
for key, value in policy_state_dict.items():
    if "lora_A" in key or "lora_B" in key:
        print(f"{key}: {value.shape}")
print("++++" * 100)
print("====" * 100)
print("====" * 100)
print("====" * 100)
print("====" * 100)

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)


print("++++" * 100)
policy_state_dict = model.state_dict()
for key, value in policy_state_dict.items():
    if "lora_A" in key or "lora_B" in key:
        print(f"{key}: {value.shape}")
print("++++" * 100)

The printed results (lora weight) are:

Before using zero 3:

base_model.model.model.layers.20.self_attn.q_proj.lora_A.default.weight: torch.Size([8, 1536]) base_model.model.model.layers.20.self_attn.q_proj.lora_B.default.weight: torch.Size([1536, 8]) base_model.model.model.layers.20.self_attn.v_proj.lora_A.default.weight: torch.Size([8, 1536]) base_model.model.model.layers.20.self_attn.v_proj.lora_B.default.weight: torch.Size([256, 8]) base_model.model.model.layers.21.self_attn.q_proj.lora_A.default.weight: torch.Size([8, 1536]) base_model.model.model.layers.21.self_attn.q_proj.lora_B.default.weight: torch.Size([1536, 8]) base_model.model.model.layers.21.self_attn.v_proj.lora_A.default.weight: torch.Size([8, 1536])

After using zero 3:

module.base_model.model.model.layers.21.self_attn.q_proj.lora_A.default.weight: torch.Size([0]) module.base_model.model.model.layers.21.self_attn.q_proj.lora_B.default.weight: torch.Size([0]) module.base_model.model.model.layers.21.self_attn.v_proj.lora_A.default.weight: torch.Size([0]) module.base_model.model.model.layers.21.self_attn.v_proj.lora_B.default.weight: torch.Size([0]) module.base_model.model.model.layers.22.self_attn.q_proj.lora_A.default.weight: torch.Size([0]) module.base_model.model.model.layers.22.self_attn.q_proj.lora_B.default.weight: torch.Size([0]) module.base_model.model.model.layers.22.self_attn.v_proj.lora_A.default.weight: torch.Size([0]) module.base_model.model.model.layers.22.self_attn.v_proj.lora_B.default.weight: torch.Size([0]) module.base_model.model.model.layers.23.self_attn.q_proj.lora_A.default.weight: torch.Size([0])

This is my zero-stage config file:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

My model is :

  name: "Qwen/Qwen/Qwen1.5-0.5B-Chat"
  # name: "Qwen/Qwen2.5-7B-Instruct"
  # name: "Qwen/Qwen2.5-32B-Instruct"
  # name: "Qwen/Qwen2.5-14B-Instruct"
  # name: "internlm/internlm2_5-1_8b"
  # name: "meta-llama/Llama-3.1-8B-Instruct"

This is my lora config:

lora_config:
  r: 8
  lora_alpha: 32
  target_modules:
    - "q_proj"    # qwen
    - "v_proj"    # qwen
  lora_dropout: 0.1
  bias: "none"
  task_type: "CAUSAL_LM"

Expected behavior

After using Deepspeed's lora+zero3, I found that the weight of lora changed to [0]; If I use zero2 without encountering such problems, can you help me?

jiangxinke avatar Apr 16 '25 11:04 jiangxinke

Hi,

After applying Zero Stage 3, you have to get the all the sharded parameters back to analyze the weights. When you apply Zero Stage 3 and use model.state_dict(), you're essentially trying to query the parameters held by the specific process/partition and NOT the entire model.

Instead use the following to reconstruct your model (just retrieving your state_dict, the model is still in the unsharded state) and query the parameters:

Change this to

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader) policy_state_dict = model.state_dict() for key, value in policy_state_dict.items():

This:

full_state_dict = accelerator.get_state_dict(model)
for key, value in full_state_dict.items():

cc. @tjruwase do you think otherwise?

therealnaveenkamal avatar May 11 '25 17:05 therealnaveenkamal