exllama
exllama copied to clipboard
not support lora with autogptq/peft?
Sorry, I'm a little confused. It seems that the project is unable to load LORA trained by the autogptq project. However, it can load LORA trained by alpaca-lora-4bit.
Here's my attempt to summarize:
gptq-for-llama can load LORA and combine it with alpaca-lora-4bit training, but it requires applying a monkey patch and LORA has its own implemented format.
auto-gptq can load LORA and combine it with a self-implemented PEFT-compatible wrapper to train LORA.
exllama can load LORA, but it only supports versions trained using alpaca-lora-4bit.
Is that correct?
I really hope that auto-gptq trained LORA can be supported as well. It would greatly alleviate dependencies between different projects.
I'm not familiar with the format that AutoGPTQ produces LoRAs in. Whether it's supported or not depends on what the resulting tensors look like. If they're FP16 and they target linear layers, they should work. If they're FP32 ExLlama will just downcast to FP16 and they should still work. But if AutoGPTQ outputs a quantized LoRA or reshapes the output in some way that deviates from PEFT, idk. I'd have to add support for that, but I need some examples and I don't have time to train my own adapters at the moment.
I'm not familiar with the format that AutoGPTQ produces LoRAs in. Whether it's supported or not depends on what the resulting tensors look like. If they're FP16 and they target linear layers, they should work. If they're FP32 ExLlama will just downcast to FP16 and they should still work. But if AutoGPTQ outputs a quantized LoRA or reshapes the output in some way that deviates from PEFT, idk. I'd have to add support for that, but I need some examples and I don't have time to train my own adapters at the moment.
Sure, I can help you with the translation. Here's the translation of the provided text:
"How can I send you a PEFT bin? What if I upload it to Google Drive? Additionally, I actually looked at the error message, and it seems that in PEFT, the format for qkv is a single tensor file, whereas in Exllama, q, k, and v are separate matrix tensors."
https://drive.google.com/drive/folders/1jQSPOb9i6QKH4kwmBcG4k71z_VC15hAF?usp=sharing
65b is too large so I make a 7b lora.
And the log is : base_model.model.model.layers.0.self_attn.qkv_proj.lora_A.weight
Error: unsupported layer in loras/7b__qLORA_adapter2/adapter_model.bin: base_model.model.model.layers.0.self_attn.qkv_proj.lora_A.weight
it seems fused_attn is root cause. If the model was loaded with fuse_attention enabled, the q_proj,k_proj,v_proj would be combined into qkv_proj.
Yes. There isn't an easy fix for this except attempting to convert those LoRAs back to the regular non-fused format. I don't know if I'll have time for that.
well,I guess it's a bad idea to convert back. I'm not sure if it's possibile or profitable that exllama can support fused format itself, then people can just use fused lora.
Anyway, for people who failed with this issue, a temperary solution is train the lora with
inject_fused_attention=False
inject_fused_mlp=False