vllm
vllm copied to clipboard
[Doc]: Clarify QLoRA (Quantized Model + LoRA) Support in Documentation
📚 The doc issue
Two parts of the documentation appear to contradict each other, especially at first glance.
Here, it is explicitly stated that LoRA inference with a quantized model is not supported: https://github.com/vllm-project/vllm/blob/4c0d93f4b2de241336f4732cb5799cee8fedcb52/docs/source/models/supported_models.md?plain=1#L59-L61
However, here, an example is provided for running offline inference with a quantized model and a LoRA adapter: https://github.com/vllm-project/vllm/blob/4c0d93f4b2de241336f4732cb5799cee8fedcb52/examples/offline_inference/lora_with_quantization_inference.py#L3-L4
To resolve this confusion, it would be very helpful to clarify the following points directly (please correct me if I am mistaken):
- QLoRA is supported, but only for offline inference. This means you cannot dynamically load LoRA adapters after loading the quantized base model.
- QLoRA is not supported with the OpenAI-compatible server, even for a single LoRA-base model pair.
Edit:
It's easy to miss on the docs site, that ##### LORA and quantization is a subsection of ### Transformers fallback, that's why I was confused.
https://github.com/vllm-project/vllm/blob/4c0d93f4b2de241336f4732cb5799cee8fedcb52/docs/source/models/supported_models.md?plain=1#L43 https://github.com/vllm-project/vllm/blob/4c0d93f4b2de241336f4732cb5799cee8fedcb52/docs/source/models/supported_models.md?plain=1#L57-L59
I think this means that transformers-fallback doesn't support these 2 features. For models integrated with vllm, we support QLoRA.
BTW, Afer https://github.com/vllm-project/vllm/pull/13166 was landed, I think transformers-fallback can support LoRA directly, cc @Isotr0py @hmellor
For models integrated with vllm, we support QLoRA.
Would be great, if you could point me to a more specific example, my understanding of vllm/transformers isn't too deep.
Take qwen2 for example, it is integrated (if I understand correctly) here vllm/model_executor/models/qwen2.py
However, running a qwen2 model quantized with vllm serve is not supported.
Could you please provide more detailed information, such as log information and errors
Could you please provide more detailed information, such as log information and errors
That's partly why I created the issue, it does load, but why does the documentation state otherwise? Did it just not get updated? Are there issues, we need to be aware of, when running QLoRa currently?
The documentation does not state otherwise.
The documentation explicitly states that quantisation and LoRA are not compatible together with the Transformers fallback.
I see now, thanks for clarifying. It's easy to miss on the docs site, that ##### LORA and quantization is a subsection of ### Transformers fallback
https://github.com/vllm-project/vllm/blob/4c0d93f4b2de241336f4732cb5799cee8fedcb52/docs/source/models/supported_models.md?plain=1#L43 https://github.com/vllm-project/vllm/blob/4c0d93f4b2de241336f4732cb5799cee8fedcb52/docs/source/models/supported_models.md?plain=1#L57-L59
Ok, we should make that clearer. Thank you for the feedback!
The documentation change in https://github.com/vllm-project/vllm/pull/12960 should help with this
Could you please provide more detailed information, such as log information and errors
That's partly why I created the issue, it does load, but why does the documentation state otherwise? Did it just not get updated? Are there issues, we need to be aware of, when running QLoRa currently?
I wanted to ask, how did you manage to fine tune a Lora adapter on a AWQ model? I thought this was not possible and peft was only compatible with GPTQ.
I wanted to ask, how did you manage to fine tune a Lora adapter on a AWQ model? I thought this was not possible and peft was only compatible with GPTQ.
The lora I tested it with was tuned with qwen loaded in bf16. To my knowledge, you still can use lora/peft adapters with the same model, even with a different quantization, but the quality of the answers may suffer.
I'm currently serving a lora on VLLM as a merge with the base model. Those screenshots are from tests I did to see what combinations of quantization, model architecture and lora could be run at all with VLLM. Didn't really analyze the quality of the answers too much, but at a glance they seemed to be in the ballpark I'm expecting.
I wanted to ask, how did you manage to fine tune a Lora adapter on a AWQ model? I thought this was not possible and peft was only compatible with GPTQ.
The lora I tested it with was tuned with qwen loaded in bf16. To my knowledge, you still can use lora/peft adapters with the same model, even with a different quantization, but the quality of the answers may suffer.
I'm currently serving a lora on VLLM as a merge with the base model. Those screenshots are from tests I did to see what combinations of quantization, model architecture and lora could be run at all with VLLM. Didn't really analyze the quality of the answers too much, but at a glance they seemed to be in the ballpark I'm expecting.
Yeah I tried and while it ran, my quality of responses were way off.