exllama icon indicating copy to clipboard operation
exllama copied to clipboard

not support lora with autogptq/peft?

Open laoda513 opened this issue 2 years ago • 5 comments

Sorry, I'm a little confused. It seems that the project is unable to load LORA trained by the autogptq project. However, it can load LORA trained by alpaca-lora-4bit.

Here's my attempt to summarize:

gptq-for-llama can load LORA and combine it with alpaca-lora-4bit training, but it requires applying a monkey patch and LORA has its own implemented format.

auto-gptq can load LORA and combine it with a self-implemented PEFT-compatible wrapper to train LORA.

exllama can load LORA, but it only supports versions trained using alpaca-lora-4bit.

Is that correct?

I really hope that auto-gptq trained LORA can be supported as well. It would greatly alleviate dependencies between different projects.

laoda513 avatar Jun 27 '23 02:06 laoda513

I'm not familiar with the format that AutoGPTQ produces LoRAs in. Whether it's supported or not depends on what the resulting tensors look like. If they're FP16 and they target linear layers, they should work. If they're FP32 ExLlama will just downcast to FP16 and they should still work. But if AutoGPTQ outputs a quantized LoRA or reshapes the output in some way that deviates from PEFT, idk. I'd have to add support for that, but I need some examples and I don't have time to train my own adapters at the moment.

turboderp avatar Jun 27 '23 11:06 turboderp

I'm not familiar with the format that AutoGPTQ produces LoRAs in. Whether it's supported or not depends on what the resulting tensors look like. If they're FP16 and they target linear layers, they should work. If they're FP32 ExLlama will just downcast to FP16 and they should still work. But if AutoGPTQ outputs a quantized LoRA or reshapes the output in some way that deviates from PEFT, idk. I'd have to add support for that, but I need some examples and I don't have time to train my own adapters at the moment.

Sure, I can help you with the translation. Here's the translation of the provided text:

"How can I send you a PEFT bin? What if I upload it to Google Drive? Additionally, I actually looked at the error message, and it seems that in PEFT, the format for qkv is a single tensor file, whereas in Exllama, q, k, and v are separate matrix tensors."

laoda513 avatar Jun 27 '23 14:06 laoda513

https://drive.google.com/drive/folders/1jQSPOb9i6QKH4kwmBcG4k71z_VC15hAF?usp=sharing

65b is too large so I make a 7b lora.

And the log is : base_model.model.model.layers.0.self_attn.qkv_proj.lora_A.weight

Error: unsupported layer in loras/7b__qLORA_adapter2/adapter_model.bin: base_model.model.model.layers.0.self_attn.qkv_proj.lora_A.weight

laoda513 avatar Jun 27 '23 14:06 laoda513

it seems fused_attn is root cause. If the model was loaded with fuse_attention enabled, the q_proj,k_proj,v_proj would be combined into qkv_proj.

laoda513 avatar Jun 28 '23 02:06 laoda513

Yes. There isn't an easy fix for this except attempting to convert those LoRAs back to the regular non-fused format. I don't know if I'll have time for that.

turboderp avatar Jun 28 '23 08:06 turboderp

well,I guess it's a bad idea to convert back. I'm not sure if it's possibile or profitable that exllama can support fused format itself, then people can just use fused lora.

Anyway, for people who failed with this issue, a temperary solution is train the lora with
inject_fused_attention=False inject_fused_mlp=False

laoda513 avatar Jun 30 '23 04:06 laoda513