vllm [Doc]: Clarify QLoRA (Quantized Model + LoRA) Support in Documentation

📚 The doc issue

Two parts of the documentation appear to contradict each other, especially at first glance.

Here, it is explicitly stated that LoRA inference with a quantized model is not supported: https://github.com/vllm-project/vllm/blob/4c0d93f4b2de241336f4732cb5799cee8fedcb52/docs/source/models/supported_models.md?plain=1#L59-L61

However, here, an example is provided for running offline inference with a quantized model and a LoRA adapter: https://github.com/vllm-project/vllm/blob/4c0d93f4b2de241336f4732cb5799cee8fedcb52/examples/offline_inference/lora_with_quantization_inference.py#L3-L4

To resolve this confusion, it would be very helpful to clarify the following points directly (please correct me if I am mistaken):

QLoRA is supported, but only for offline inference. This means you cannot dynamically load LoRA adapters after loading the quantized base model.
QLoRA is not supported with the OpenAI-compatible server, even for a single LoRA-base model pair.

Edit:

It's easy to miss on the docs site, that ##### LORA and quantization is a subsection of ### Transformers fallback, that's why I was confused.

https://github.com/vllm-project/vllm/blob/4c0d93f4b2de241336f4732cb5799cee8fedcb52/docs/source/models/supported_models.md?plain=1#L43 https://github.com/vllm-project/vllm/blob/4c0d93f4b2de241336f4732cb5799cee8fedcb52/docs/source/models/supported_models.md?plain=1#L57-L59

Feb 12 '25 23:02 AlexanderZhk

I think this means that transformers-fallback doesn't support these 2 features. For models integrated with vllm, we support QLoRA.

BTW, Afer https://github.com/vllm-project/vllm/pull/13166 was landed, I think transformers-fallback can support LoRA directly, cc @Isotr0py @hmellor

Feb 13 '25 01:02 jeejeelee

For models integrated with vllm, we support QLoRA.

Would be great, if you could point me to a more specific example, my understanding of vllm/transformers isn't too deep.

Take qwen2 for example, it is integrated (if I understand correctly) here vllm/model_executor/models/qwen2.py However, running a qwen2 model quantized with vllm serve is not supported.

Feb 13 '25 14:02 AlexanderZhk

Could you please provide more detailed information, such as log information and errors

Feb 13 '25 16:02 jeejeelee

Could you please provide more detailed information, such as log information and errors

That's partly why I created the issue, it does load, but why does the documentation state otherwise? Did it just not get updated? Are there issues, we need to be aware of, when running QLoRa currently?

Feb 14 '25 19:02 AlexanderZhk

The documentation does not state otherwise.

The documentation explicitly states that quantisation and LoRA are not compatible together with the Transformers fallback.

Feb 14 '25 21:02 hmellor

I see now, thanks for clarifying. It's easy to miss on the docs site, that ##### LORA and quantization is a subsection of ### Transformers fallback

https://github.com/vllm-project/vllm/blob/4c0d93f4b2de241336f4732cb5799cee8fedcb52/docs/source/models/supported_models.md?plain=1#L43 https://github.com/vllm-project/vllm/blob/4c0d93f4b2de241336f4732cb5799cee8fedcb52/docs/source/models/supported_models.md?plain=1#L57-L59

Feb 15 '25 02:02 AlexanderZhk

Ok, we should make that clearer. Thank you for the feedback!

Feb 15 '25 11:02 hmellor

The documentation change in https://github.com/vllm-project/vllm/pull/12960 should help with this

Feb 17 '25 16:02 hmellor

Could you please provide more detailed information, such as log information and errors

That's partly why I created the issue, it does load, but why does the documentation state otherwise? Did it just not get updated? Are there issues, we need to be aware of, when running QLoRa currently?

I wanted to ask, how did you manage to fine tune a Lora adapter on a AWQ model? I thought this was not possible and peft was only compatible with GPTQ.

Apr 01 '25 00:04 JJEccles

I wanted to ask, how did you manage to fine tune a Lora adapter on a AWQ model? I thought this was not possible and peft was only compatible with GPTQ.

The lora I tested it with was tuned with qwen loaded in bf16. To my knowledge, you still can use lora/peft adapters with the same model, even with a different quantization, but the quality of the answers may suffer.

I'm currently serving a lora on VLLM as a merge with the base model. Those screenshots are from tests I did to see what combinations of quantization, model architecture and lora could be run at all with VLLM. Didn't really analyze the quality of the answers too much, but at a glance they seemed to be in the ballpark I'm expecting.

Apr 01 '25 22:04 AlexanderZhk

I wanted to ask, how did you manage to fine tune a Lora adapter on a AWQ model? I thought this was not possible and peft was only compatible with GPTQ.

The lora I tested it with was tuned with qwen loaded in bf16. To my knowledge, you still can use lora/peft adapters with the same model, even with a different quantization, but the quality of the answers may suffer.

I'm currently serving a lora on VLLM as a merge with the base model. Those screenshots are from tests I did to see what combinations of quantization, model architecture and lora could be run at all with VLLM. Didn't really analyze the quality of the answers too much, but at a glance they seemed to be in the ballpark I'm expecting.

Yeah I tried and while it ran, my quality of responses were way off.

Apr 04 '25 02:04 JJEccles

vllm vllm copied to clipboard

[Doc]: Clarify QLoRA (Quantized Model + LoRA) Support in Documentation

📚 The doc issue

vllm
vllm copied to clipboard