peft LoRA layer param.grad=None after loss.backward() using lora for mamba 2.8B

I got exactly the same error as the author @mahdip72 .

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 1: 36 37 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
But on single GPU there is not a problem and it works fine!

### Expected behavior
I expect the LoRa method to work on both single and multi GPUs. But it shows an error when using multiple GPUs.

I tried lora on the open source Mamba-based cobra(an open source VLM trained on mamba-2.8b-zephyr), loaded with mamba-2.8b-zephyr weights. Here's the code I used to initialize the llm:

def init_llm(cls, llm_backbone_id, llm_max_length, low_resource=False, low_res_device=0, lora_r=0,
                 lora_target_modules=['x_proj', 'embeddings', 'in_proj', 'out_proj'], **lora_kargs):
        logging.info('Loading Mamba')
        mamba_model = MambaLLMBackbone(
            llm_backbone_id=llm_backbone_id,
            llm_max_length=llm_max_length
        )
        mamba_tokenizer = mamba_model.get_tokenizer()
        if lora_r > 0:
            mamba_model = prepare_model_for_kbit_training(mamba_model)
            lora_config = LoraConfig(
                r=lora_r,
                bias="none",
                task_type="CAUSAL_LM",
                target_modules=lora_target_modules,
                **lora_kargs
            )
            mamba_model = get_peft_model(mamba_model, lora_config)
            mamba_model.print_trainable_parameters()
            # for name, param in mamba_model.named_parameters():
            #     param.requires_grad = False
                # if any(module in name for module in lora_target_modules):
                #     param.requires_grad = True
        else:
            for name, param in mamba_model.named_parameters():
                param.requires_grad = False

        # for name, param in mamba_model.named_parameters():
        #     if param.requires_grad:
        #         print(name)
        logging.info('Loading Mamba Done')
        return mamba_model, mamba_tokenizer

However, after loss.backward(), the grad of all lora layers is None, which also leads to this problem, and I have not found the reason:

with torch.cuda.amp.autocast(enabled=use_amp):
                loss, loss_dict = self.train_step(model=model, samples=samples)
                loss /= accum_grad_iters #TODO: not affect loss_dict values for logging

            # after_train_step()
            if use_amp:
                scaler.scale(loss).backward()
            else:
                loss.backward()

            for name, param in model.named_parameters():
                if param.requires_grad and param.grad is None:
                    print(f"{name} requires_grad and not used forward pass.")

Print (part of it) :

mamba_model.base_model.model.llm.backbone.layers.0.mixer.in_proj.lora_A.default.weight requires_grad and not used forward pass.
mamba_model.base_model.model.llm.backbone.layers.0.mixer.in_proj.lora_B.default.weight requires_grad and not used forward pass.
mamba_model.base_model.model.llm.backbone.layers.0.mixer.x_proj.lora_A.default.weight requires_grad and not used forward pass.
mamba_model.base_model.model.llm.backbone.layers.0.mixer.x_proj.lora_B.default.weight requires_grad and not used forward pass.
mamba_model.base_model.model.llm.backbone.layers.0.mixer.out_proj.lora_A.default.weight requires_grad and not used forward pass.
mamba_model.base_model.model.llm.backbone.layers.0.mixer.out_proj.lora_B.default.weight requires_grad and not used forward pass.

I checked the whole model, including the visual part, by printing param.requires_grad (the model is similar to cobra, it is a visual encoder plus mamba LLM). param.requires_grad of the whole model is OK. However, the lora part appears grad=None.

Manually setting requires_grad=False as the authors did won't throw an error, but it freezes the whole mamba LLM and makes LoRA useless. mamba's official documentation indicates that LoRA can be used, but I didn't use LoRA the way they gave me. But now the problem with the code also makes me very confused, I don't know what the problem is.I would appreciate it if someone could help me solve this problem.

Originally posted by @jinlHe in https://github.com/huggingface/peft/issues/951#issuecomment-2029891217

Apr 01 '24 15:04 jinlHe

So I'm assuming that you are training with DDP. DDP requires that all model parameters that require gradients participate in the loss calculation. For some reason, it appears that the LoRA layers are not part of the computational graph, resulting in the error that you reported.

Now why do these layers not participate? It's hard to tell from the given information. My suspicion is that PEFT LoRA just does not work with Cobra. The exact reason is not clear to me. As the code you provide is not sufficient to reproduce the issue, and as this involves installing some smaller packages that I would have to vet, I cannot debug this any further at the moment.

If you want to debug this further, here are some pointers:

Could you try training this without DDP? Probably the error will go away, but are the LoRA weights updated? My guess is no, but it's important to check this.
IIUC, this should use the peft.tuners.lora.Linear layer. Could you add a breakpoint to its forward method and see if it is reached at all? If yes, what code path is taken? For debugging this, you also have to train without DDP.

Manually setting requires_grad=False as the authors did won't throw an error, but it freezes the whole mamba LLM and makes LoRA useless.

Indeed, as you wrote, this would not be a solution to your issue.

mamba's official documentation indicates that LoRA can be used, but I didn't use LoRA the way they gave me.

This is only partially helpful, as that is a different model. Still, what exactly are the changes you made compared to the suggestion?

Apr 02 '24 13:04 BenjaminBossan

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

May 02 '24 15:05 github-actions[bot]

I have exactly same problem. When I print param.requires_grad it is True, but param.grad is None. Are you use F.multi_head_attention_forward()? In my case, my foundation model use F.multi_head_attention_forward(), and it does not get self.q_proj itself, but self.q_proj.weight and self.v_proj.bias. I init self.q_proj = lora.Linear(self.kdim, embed_dim, bias=bias, r=self.rank) properly. But F.multi_head_attention_forward() do not get lora_A and lora_B, because of this my lora cannot learn. I can solve that problem by not use F.multi_head_attention_forward(). Please check your forward method.

May 07 '24 09:05 planetes00

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

May 31 '24 15:05 github-actions[bot]

peft peft copied to clipboard

LoRA layer param.grad=None after loss.backward() using lora for mamba 2.8B

peft
peft copied to clipboard