peft When using the ZeRO3 configuration of deepspeed, the target_parameters cannot obtain the shape of the parameters

System Info

peft:I installed it from the source code(pip install git+https://github.com/huggingface/peft) accelerate:1.7.0 transformers:4.57.0 platform:Ubuntu 24.04.2 LTS python:3.11.11 deepspeed:0.16.4

Who can help?

No response

Reproduction

When fine-tuning certain Fused MoE models using LoRA, I need to use the 'target_parameters' parameter This method will obtain the loaded parameters：https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py#L2070 However, due to the use of the ZeRO3 configuration of deepspeed, the value returned by this method is an empty tensor Some similar issues that occurred due to ZeRO3: #2500 , #2603

Expected behavior

I hope to solve this problem

Oct 21 '25 04:10 qiu-xiao-0330

Thanks for the report. If possible, could you please try out one thing: Go into the PEFT source code and change the method like so:

    def get_param(self):
        param = getattr(self.get_base_layer(), self.parameter_name)

        from peft.utils.integrations import gather_params_ctx
        with gather_params_ctx(param):
            return param

Oct 21 '25 08:10 BenjaminBossan

Thanks for the report. If possible, could you please try out one thing: Go into the PEFT source code and change the method like so:
def get_param(self):
    param = getattr(self.get_base_layer(), self.parameter_name)

    from peft.utils.integrations import gather_params_ctx
    with gather_params_ctx(param):
        return param

Thank you,@BenjaminBossan After running the code you provided, I still encountered an error This is because in the with gather_params_ctx(param):, param can be accessed, but the returned param is still an empty tensor In this line of code, the param is still an empty tensor https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py#L1927

Oct 22 '25 06:10 qiu-xiao-0330

@qiu-xiao-0330 I tried to reproduce the issue but was unsuccessful so far. Could you please tell me what model you're trying to train?

Oct 23 '25 15:10 BenjaminBossan

@qiu-xiao-0330 I tried to reproduce the issue but was unsuccessful so far. Could you please tell me what model you're trying to train?

@BenjaminBossan,I am currently training the Qwen3-VL-235B-A22B-Thinking model

Oct 24 '25 02:10 qiu-xiao-0330

Thanks for the info @qiu-xiao-0330. Unfortunately, that model is too big for me to train (same with 30B), so I tried a smaller variant of the model, "yujiepan/qwen3-vl-moe-tiny-random". For the targets, I chose target_parameters=["experts.gate_up_proj", "experts.down_proj"]. This worked for me. Any idea what's different in your case?

Full code:

import argparse
import os
import tempfile
from typing import Literal

import torch
from accelerate import PartialState
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    DataCollatorForLanguageModeling,
    Qwen3VLMoeForConditionalGeneration,
    Trainer,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model


def print_if_process_zero(*args, **kwargs):
    PartialState().print(*args, **kwargs)


def main(model_id: str, quant: Literal["4bit", "8bit"] | None = None, target_modules: list[str] | None = None):
    if target_modules == ["all-linear"]:
        target_modules = "all-linear"

    data = load_dataset("ybelkada/english_quotes_copy")

    if quant == "4bit":
        quant_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_type="bfloat16",
            bnb_4bit_quant_storage="bfloat16",
            bnb_4bit_use_double_quant=True,
        )
    elif quant == "8bit":
        quant_config = BitsAndBytesConfig(load_in_8bit=True)
    elif quant is None:
        quant_config = None
    else:
        raise ValueError(f"Unsupported quantization: {quant}, expected one of '4bit', '8bit', or None")

    tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-VL-30B-A3B-Instruct")
    if not tokenizer.pad_token:
        tokenizer.pad_token = tokenizer.eos_token

    if "qwen3" in model_id.lower():
        model_cls = Qwen3VLMoeForConditionalGeneration
    else:
        model_cls = AutoModelForCausalLM
    model = model_cls.from_pretrained(
        model_id, quantization_config=quant_config, dtype=torch.bfloat16, device_map={"": PartialState().process_index}
    )

    peft_config = LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=[],
        bias="none",
        task_type="CAUSAL_LM",
        target_parameters=["experts.gate_up_proj", "experts.down_proj"],
    )
    model = get_peft_model(model, peft_config)
    print_if_process_zero(model)
    if PartialState().is_local_main_process:
        model.print_trainable_parameters()

    data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

    with tempfile.TemporaryDirectory() as tmp_dir:
        trainer = Trainer(
            model=model,
            train_dataset=data["train"],
            optimizer_cls_and_kwargs=(torch.optim.SGD, {"lr": 2e-4}),
            args=TrainingArguments(
                per_device_train_batch_size=4,
                gradient_accumulation_steps=2,
                warmup_steps=2,
                max_steps=25,
                learning_rate=2e-4,
                bf16=True,
                logging_steps=5,
                output_dir=tmp_dir,
            ),
            data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
        )
        trainer.train()

        if trainer.is_fsdp_enabled:
            trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")
        trainer.save_model(tmp_dir)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    model_id_ = "yujiepan/qwen3-vl-moe-tiny-random"
    parser.add_argument("--model_id", type=str, required=False, default=model_id_)
    parser.add_argument("--quant", type=str, choices=["4bit", "8bit"], required=False, default=None)
    parser.add_argument(
        "--target_modules",
        type=str,
        nargs="+",
        required=False,
        default=None,
        help="List of target modules for LoRA adaptation",
    )
    args = parser.parse_args()
    main(model_id=args.model_id, quant=args.quant, target_modules=args.target_modules)

The config:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 2
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Launched with:

accelerate launch --config_file deepspeed_config.yaml train.py

Using the latest PEFT and transformers versions installed from source, PyTorch 2.9.

Oct 24 '25 09:10 BenjaminBossan

Thank you for your reply and your code, @BenjaminBossan I used your code to run it on 4 H100 PCIe gpus, and it worked successfully But there are still several issues 1.When I used the 30B model, I found that tensor parallelization was not enabled. If it were enabled, the memory usage per GPU should have been around 20G. However, the actual memory usage was 80G (with a batch size of 1) 2.When modifying the num_processes parameter in the deepspeed file to 4, an error occurred: disagreement between rank0 and rank2 3.When modifying the num_processes parameter in the deepspeed file to 2,it will get stuck at the first step 4.The model yujiepan/qwen3-vl-moe-tiny-random can be trained normally, but the 30B model cannot be trained properly

Some versions of the packages

transformers:5.0.0.dev0
peft:0.17.2.dev0
accelerate:1.7.0
deepspeed:0.16.4

I did not add the following code because the current error message does not seem to be caused by this code

    def get_param(self):
        param = getattr(self.get_base_layer(), self.parameter_name)

        from peft.utils.integrations import gather_params_ctx
        with gather_params_ctx(param):
            return param

Oct 28 '25 10:10 qiu-xiao-0330

1.When I used the 30B model, I found that tensor parallelization was not enabled.

We haven't checked if/how PEFT targeting layers with tensor parallelism works. Could you ideally share some minimal code to reproduce so that we can look into this issue?

When modifying the num_processes parameter in the deepspeed file to 4, an error occurred: disagreement between rank0 and rank2

Could you please share the whole stack trace?

3.When modifying the num_processes parameter in the deepspeed file to 2,it will get stuck at the first step

Again, we need a reproducer to check this further.

4.The model yujiepan/qwen3-vl-moe-tiny-random can be trained normally, but the 30B model cannot be trained properly

Okay, so you tried with Qwen/Qwen3-VL-30B-A3B-Thinking? Since earlier, you mentioned that you were using the 235B model.

Oct 29 '25 13:10 BenjaminBossan

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Nov 22 '25 15:11 github-actions[bot]