vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Doc]: Clarify QLoRA (Quantized Model + LoRA) Support in Documentation

Open AlexanderZhk opened this issue 9 months ago • 7 comments

📚 The doc issue

Two parts of the documentation appear to contradict each other, especially at first glance.

Here, it is explicitly stated that LoRA inference with a quantized model is not supported: https://github.com/vllm-project/vllm/blob/4c0d93f4b2de241336f4732cb5799cee8fedcb52/docs/source/models/supported_models.md?plain=1#L59-L61

However, here, an example is provided for running offline inference with a quantized model and a LoRA adapter: https://github.com/vllm-project/vllm/blob/4c0d93f4b2de241336f4732cb5799cee8fedcb52/examples/offline_inference/lora_with_quantization_inference.py#L3-L4

To resolve this confusion, it would be very helpful to clarify the following points directly (please correct me if I am mistaken):

  1. QLoRA is supported, but only for offline inference. This means you cannot dynamically load LoRA adapters after loading the quantized base model.
  2. QLoRA is not supported with the OpenAI-compatible server, even for a single LoRA-base model pair.

Edit:

It's easy to miss on the docs site, that ##### LORA and quantization is a subsection of ### Transformers fallback, that's why I was confused.

https://github.com/vllm-project/vllm/blob/4c0d93f4b2de241336f4732cb5799cee8fedcb52/docs/source/models/supported_models.md?plain=1#L43 https://github.com/vllm-project/vllm/blob/4c0d93f4b2de241336f4732cb5799cee8fedcb52/docs/source/models/supported_models.md?plain=1#L57-L59

AlexanderZhk avatar Feb 12 '25 23:02 AlexanderZhk

I think this means that transformers-fallback doesn't support these 2 features. For models integrated with vllm, we support QLoRA.

BTW, Afer https://github.com/vllm-project/vllm/pull/13166 was landed, I think transformers-fallback can support LoRA directly, cc @Isotr0py @hmellor

jeejeelee avatar Feb 13 '25 01:02 jeejeelee

For models integrated with vllm, we support QLoRA.

Would be great, if you could point me to a more specific example, my understanding of vllm/transformers isn't too deep.

Take qwen2 for example, it is integrated (if I understand correctly) here vllm/model_executor/models/qwen2.py However, running a qwen2 model quantized with vllm serve is not supported.

AlexanderZhk avatar Feb 13 '25 14:02 AlexanderZhk

Could you please provide more detailed information, such as log information and errors

jeejeelee avatar Feb 13 '25 16:02 jeejeelee

Could you please provide more detailed information, such as log information and errors

That's partly why I created the issue, it does load, but why does the documentation state otherwise? Did it just not get updated? Are there issues, we need to be aware of, when running QLoRa currently?

Image

AlexanderZhk avatar Feb 14 '25 19:02 AlexanderZhk

The documentation does not state otherwise.

The documentation explicitly states that quantisation and LoRA are not compatible together with the Transformers fallback.

hmellor avatar Feb 14 '25 21:02 hmellor

I see now, thanks for clarifying. It's easy to miss on the docs site, that ##### LORA and quantization is a subsection of ### Transformers fallback

https://github.com/vllm-project/vllm/blob/4c0d93f4b2de241336f4732cb5799cee8fedcb52/docs/source/models/supported_models.md?plain=1#L43 https://github.com/vllm-project/vllm/blob/4c0d93f4b2de241336f4732cb5799cee8fedcb52/docs/source/models/supported_models.md?plain=1#L57-L59

AlexanderZhk avatar Feb 15 '25 02:02 AlexanderZhk

Ok, we should make that clearer. Thank you for the feedback!

hmellor avatar Feb 15 '25 11:02 hmellor

The documentation change in https://github.com/vllm-project/vllm/pull/12960 should help with this

hmellor avatar Feb 17 '25 16:02 hmellor

Could you please provide more detailed information, such as log information and errors

That's partly why I created the issue, it does load, but why does the documentation state otherwise? Did it just not get updated? Are there issues, we need to be aware of, when running QLoRa currently?

Image

I wanted to ask, how did you manage to fine tune a Lora adapter on a AWQ model? I thought this was not possible and peft was only compatible with GPTQ.

JJEccles avatar Apr 01 '25 00:04 JJEccles

I wanted to ask, how did you manage to fine tune a Lora adapter on a AWQ model? I thought this was not possible and peft was only compatible with GPTQ.

The lora I tested it with was tuned with qwen loaded in bf16. To my knowledge, you still can use lora/peft adapters with the same model, even with a different quantization, but the quality of the answers may suffer.

I'm currently serving a lora on VLLM as a merge with the base model. Those screenshots are from tests I did to see what combinations of quantization, model architecture and lora could be run at all with VLLM. Didn't really analyze the quality of the answers too much, but at a glance they seemed to be in the ballpark I'm expecting.

AlexanderZhk avatar Apr 01 '25 22:04 AlexanderZhk

I wanted to ask, how did you manage to fine tune a Lora adapter on a AWQ model? I thought this was not possible and peft was only compatible with GPTQ.

The lora I tested it with was tuned with qwen loaded in bf16. To my knowledge, you still can use lora/peft adapters with the same model, even with a different quantization, but the quality of the answers may suffer.

I'm currently serving a lora on VLLM as a merge with the base model. Those screenshots are from tests I did to see what combinations of quantization, model architecture and lora could be run at all with VLLM. Didn't really analyze the quality of the answers too much, but at a glance they seemed to be in the ballpark I'm expecting.

Yeah I tried and while it ran, my quality of responses were way off.

JJEccles avatar Apr 04 '25 02:04 JJEccles