When using the ZeRO3 configuration of deepspeed, the target_parameters cannot obtain the shape of the parameters
System Info
peft:I installed it from the source code(pip install git+https://github.com/huggingface/peft) accelerate:1.7.0 transformers:4.57.0 platform:Ubuntu 24.04.2 LTS python:3.11.11 deepspeed:0.16.4
Who can help?
No response
Reproduction
When fine-tuning certain Fused MoE models using LoRA, I need to use the 'target_parameters' parameter This method will obtain the loaded parameters:https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py#L2070 However, due to the use of the ZeRO3 configuration of deepspeed, the value returned by this method is an empty tensor Some similar issues that occurred due to ZeRO3: #2500 , #2603
Expected behavior
I hope to solve this problem
Thanks for the report. If possible, could you please try out one thing: Go into the PEFT source code and change the method like so:
def get_param(self):
param = getattr(self.get_base_layer(), self.parameter_name)
from peft.utils.integrations import gather_params_ctx
with gather_params_ctx(param):
return param
Thanks for the report. If possible, could you please try out one thing: Go into the PEFT source code and change the method like so:
def get_param(self): param = getattr(self.get_base_layer(), self.parameter_name) from peft.utils.integrations import gather_params_ctx with gather_params_ctx(param): return param
Thank you,@BenjaminBossan
After running the code you provided, I still encountered an error
This is because in the with gather_params_ctx(param):, param can be accessed, but the returned param is still an empty tensor
In this line of code, the param is still an empty tensor
https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py#L1927
@qiu-xiao-0330 I tried to reproduce the issue but was unsuccessful so far. Could you please tell me what model you're trying to train?
@qiu-xiao-0330 I tried to reproduce the issue but was unsuccessful so far. Could you please tell me what model you're trying to train?
@BenjaminBossan,I am currently training the Qwen3-VL-235B-A22B-Thinking model
Thanks for the info @qiu-xiao-0330. Unfortunately, that model is too big for me to train (same with 30B), so I tried a smaller variant of the model, "yujiepan/qwen3-vl-moe-tiny-random". For the targets, I chose target_parameters=["experts.gate_up_proj", "experts.down_proj"]. This worked for me. Any idea what's different in your case?
Full code:
import argparse
import os
import tempfile
from typing import Literal
import torch
from accelerate import PartialState
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
DataCollatorForLanguageModeling,
Qwen3VLMoeForConditionalGeneration,
Trainer,
TrainingArguments,
)
from peft import LoraConfig, get_peft_model
def print_if_process_zero(*args, **kwargs):
PartialState().print(*args, **kwargs)
def main(model_id: str, quant: Literal["4bit", "8bit"] | None = None, target_modules: list[str] | None = None):
if target_modules == ["all-linear"]:
target_modules = "all-linear"
data = load_dataset("ybelkada/english_quotes_copy")
if quant == "4bit":
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_type="bfloat16",
bnb_4bit_quant_storage="bfloat16",
bnb_4bit_use_double_quant=True,
)
elif quant == "8bit":
quant_config = BitsAndBytesConfig(load_in_8bit=True)
elif quant is None:
quant_config = None
else:
raise ValueError(f"Unsupported quantization: {quant}, expected one of '4bit', '8bit', or None")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-VL-30B-A3B-Instruct")
if not tokenizer.pad_token:
tokenizer.pad_token = tokenizer.eos_token
if "qwen3" in model_id.lower():
model_cls = Qwen3VLMoeForConditionalGeneration
else:
model_cls = AutoModelForCausalLM
model = model_cls.from_pretrained(
model_id, quantization_config=quant_config, dtype=torch.bfloat16, device_map={"": PartialState().process_index}
)
peft_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=[],
bias="none",
task_type="CAUSAL_LM",
target_parameters=["experts.gate_up_proj", "experts.down_proj"],
)
model = get_peft_model(model, peft_config)
print_if_process_zero(model)
if PartialState().is_local_main_process:
model.print_trainable_parameters()
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)
with tempfile.TemporaryDirectory() as tmp_dir:
trainer = Trainer(
model=model,
train_dataset=data["train"],
optimizer_cls_and_kwargs=(torch.optim.SGD, {"lr": 2e-4}),
args=TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
warmup_steps=2,
max_steps=25,
learning_rate=2e-4,
bf16=True,
logging_steps=5,
output_dir=tmp_dir,
),
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train()
if trainer.is_fsdp_enabled:
trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")
trainer.save_model(tmp_dir)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
model_id_ = "yujiepan/qwen3-vl-moe-tiny-random"
parser.add_argument("--model_id", type=str, required=False, default=model_id_)
parser.add_argument("--quant", type=str, choices=["4bit", "8bit"], required=False, default=None)
parser.add_argument(
"--target_modules",
type=str,
nargs="+",
required=False,
default=None,
help="List of target modules for LoRA adaptation",
)
args = parser.parse_args()
main(model_id=args.model_id, quant=args.quant, target_modules=args.target_modules)
The config:
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
gradient_accumulation_steps: 2
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Launched with:
accelerate launch --config_file deepspeed_config.yaml train.py
Using the latest PEFT and transformers versions installed from source, PyTorch 2.9.
Thank you for your reply and your code, @BenjaminBossan I used your code to run it on 4 H100 PCIe gpus, and it worked successfully But there are still several issues 1.When I used the 30B model, I found that tensor parallelization was not enabled. If it were enabled, the memory usage per GPU should have been around 20G. However, the actual memory usage was 80G (with a batch size of 1) 2.When modifying the num_processes parameter in the deepspeed file to 4, an error occurred: disagreement between rank0 and rank2 3.When modifying the num_processes parameter in the deepspeed file to 2,it will get stuck at the first step 4.The model yujiepan/qwen3-vl-moe-tiny-random can be trained normally, but the 30B model cannot be trained properly
Some versions of the packages
- transformers:5.0.0.dev0
- peft:0.17.2.dev0
- accelerate:1.7.0
- deepspeed:0.16.4
I did not add the following code because the current error message does not seem to be caused by this code
def get_param(self):
param = getattr(self.get_base_layer(), self.parameter_name)
from peft.utils.integrations import gather_params_ctx
with gather_params_ctx(param):
return param
1.When I used the 30B model, I found that tensor parallelization was not enabled.
We haven't checked if/how PEFT targeting layers with tensor parallelism works. Could you ideally share some minimal code to reproduce so that we can look into this issue?
- When modifying the num_processes parameter in the deepspeed file to 4, an error occurred: disagreement between rank0 and rank2
Could you please share the whole stack trace?
3.When modifying the num_processes parameter in the deepspeed file to 2,it will get stuck at the first step
Again, we need a reproducer to check this further.
4.The model yujiepan/qwen3-vl-moe-tiny-random can be trained normally, but the 30B model cannot be trained properly
Okay, so you tried with Qwen/Qwen3-VL-30B-A3B-Thinking? Since earlier, you mentioned that you were using the 235B model.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.