vllm
vllm copied to clipboard
support QLoRA
QLoRA (https://arxiv.org/abs/2305.14314) cuts memory consumption in LLM weight loading without degrading performance. The weights of the basic model , which are quantized into 4 bit, pair with a low-rank but higher-precision Low-Rank weight matrix to generate output.
This MR is the first step in supporting QLoRA in vLLM. With the PR, the Qlora author's open model on hugging face, such as, is supported:
- https://huggingface.co/timdettmers/qlora-flan-7b (Its corresponding large model is "huggyllama/llama-7b")
User can run with or without a QLoRA adapter.
So far, only llama as a basic model is supported. More to come in the future. As explained below, special consideration is made for extensibility to future changes and other models. Also, no TP or PP with QLoRA is supported. It will be considered as the immediate next effort.
Explanation on Changes
Modified files mainly include
- Modify vllm/config.py, vllm/engine/arg_utils.py: Add new CLI parameters for QLoRA. Tow new parameters are added
- qlora_adapter_name_or_path : the path to the adpater repo. Could be empty.
- qlora_refresh_quant_cache: whether re-use the cache of existing basic model quantized weights
- Modify vllm/model_loader/weight_utils.py: Add logic to read adapter_configuration from LoRA's hugging face repo.
- Modify vllm/model_executor/layers/linear.py: Add the logic of concatenate tensor in QLoRA in the weight_loader () function of QKVParallelLinear class and MergedColumnParallelLinear class.
- Modify vllm/model_loader/loader.py: In the case of Qlora, use qlora_load_weights of the model to the load_weights of the QLoRA version, instead of the original load_weights.
The newly added files are:
- VLLM/model_executor/layers/quantization/qlora.py: Here, similar to other quantization methods, we define two classes, class QLoRAConfig (QuantizationConfig) and class QLoRALinearMethod (LinearMethodBase). This is the core change of the entire PR.
- VLLM/model_executor/layers/quantization/qlora_utils.py: This file includes the functions necessary to QLoRA implementation:
- A decorator class to mark a model as QLoRA supported. So far, only llama is marked. When running QLoRA with non-verified models, warning will be given.
- The qlora_load_weight func. This func is designed in a such a way that the original load_weights func in the original model is untouched. The qlora_load_weights() func is agnostic of model architecture as much as possible. qlora_load_weights is designed to be a wrapper of the existing model.load_weights() to ensure extensibility to models other than llama as well as the future changes in the already-qlora-supported model.
- Examples/qlora_offline_inference.py: Demonstration of the use of QLORA, both with and without an adapter.
ping @Yard1
This completely bypasses the existing LoRA logic and implements its own. I don't think this is a good design and it clashes with already existing code. We should instead modify the LoRA support already present in vLLM to support QLoRA - it should also allow us to reuse a lot of existing code.
Thanks for your reply. You are not the first one who popped this concern. Actually, I asked myself the same question. :-)
I considered about re-use of LoRA at the first place. I have to start a new set of code because:
- the existing LoRA in vLLM is implementing punica (https://github.com/punica-ai/punica), a multi-tenant scenario of LoRA. A lot of effort have made on the LoRA manager, which manages the cases where different sets of fine-tune weights using the same basic models.
But QLoRA, though carring a very similar name, work for a totally different scenario, thus unable to re-use the existing code of LoRA in vLLM.
-
punica is based on cuda code of BGMV, and BGMV does not support any quantization. But in QLORA, quantization of basic model is the keypoint in saving memory. This is another reason I had to deviate away from reusing LoRA.
-
On the other hand, QLoRA use a different set of cuda code. The author of QLORA provides the Cuda implementation of QLORA implemention and packed in the python package of bitsandbytes, which is used in the QLORA implementation of huggingface transformers package. So I moved away from re-using the LoRA code.
How about I add some comments somewhere to clarify your concern?
Is it theoretically possible for the QLoRA adapter to be loaded and unloaded at will?
Is it theoretically possible for the QLoRA adapter to be loaded and unloaded at will?
I am not sure what you mean by "at will". Do you mean load/unload during runtime?
In this implementation, user can load an adpater by specfiying "qlora_adapter_name_or_path" in parameter when starting the inference. User can also run without an adapter by leaving the above parameter empty.
However, the user cannot switch the adapter during the runtime. Switching adapter is not a scenario supported in the QLoRA design.
The main goal of QLoRA is to use to LoRA weights to compensate the loss caused by the 4-bit quantization in the basic model. So it is a quantization technique. Switching LoRA to support different fine-tune scenarios, as in punica, is not in its design goals.
Ok, that's what I wanted to confirm. Thanks for clearing it up. In that case:
- for consistency, I would suggest ditching the
qlora_supporteddecorator and just specify the class attribute directly on the class - we should avoid the
if model_config.quantization == "qlora":pattern in linear layer and weight loading code - instead we should use abstractions (and add them if they are missing). For example, we should add aQLoRAModelLoaderwhich can subclass/composeDefaultModelLoader. Same for linear layer - we should avoid adding special cases to generic implementations (I understand this pattern is not always followed in the codebase, but we should hold new code to higher standard - happy to discuss what sort of API we need to add to get rid of theSpecial case for Quantized Weights.in linear layer implementation)
Ok, that's what I wanted to confirm. Thanks for clearing it up. In that case:
- for consistency, I would suggest ditching the
qlora_supporteddecorator and just specify the class attribute directly on the class- we should avoid the
if model_config.quantization == "qlora":pattern in linear layer and weight loading code - instead we should use abstractions (and add them if they are missing). For example, we should add aQLoRAModelLoaderwhich can subclass/composeDefaultModelLoader. Same for linear layer - we should avoid adding special cases to generic implementations (I understand this pattern is not always followed in the codebase, but we should hold new code to higher standard - happy to discuss what sort of API we need to add to get rid of theSpecial case for Quantized Weights.in linear layer implementation)
Thanks for the suggestion. I will make the changes as suggested.
Cheers!
Thank you for your excellent work. Here are some personal opinions:
- vLLM has supported quantized models with LoRA, refer to quant model+lora. These can be generalized as QLoRA (e.g., GPTQ+LoRA), and all of them support switching adapters.
- For the original QLoRA (https://arxiv.org/abs/2305.14314), I think we should add a new quantization method named
bitsandbytes(e.g., BAB+LoRA), refer to https://github.com/vllm-project/vllm/issues/4033, and then we can reuse the current LoRA logic. - Regardless of LoRA or QLoRA, Punica can support these
If I am wrong, please correct me directly, Thanks again.
Cheers!
Thank you for your excellent work. Here are some personal opinions:
- vLLM has supported quantized models with LoRA, refer to quant model+lora. These can be generalized as QLoRA (e.g., GPTQ+LoRA), and all of them support switching adapters.
- For the original QLoRA (https://arxiv.org/abs/2305.14314), I think we should add a new quantization method named
bitsandbytes(e.g., BAB+LoRA), refer to [Feature]: bitsandbytes support #4033, and then we can reuse the current LoRA logic.- Regardless of LoRA or QLoRA, Punica can support these
If I am wrong, please correct me directly, Thanks again.
Cheers!
I re-read the LoRA code carefully and saw that quantization is supported in LoRA now. It was not supported when I started my design and coding. Sorry for the miss.
I will re-think my design again based on this change, as well as Yard1's suggestions.
Thanks & Happy Coding!
@Yard1 @jeejeelee
I just updated the MR of QLoRA/BitsAndBytes with the changes suggested. Could you please take another look?
Thanks for the great advice from you. Learned a lot and improved a lot. :-)
BTW, I hit a lot of yapf errors in CI/CD. I found the the yapf errors are not from me. Should I just ignore it?
@chenqianfzh We cannot igore format error, you can run bash format.sh to check for format errors
We should also add a test for this - it's ok if it's just an end to end one (load a small model from huggingface hub and see if it works and gives good outputs)
@mgoin @Yard1 @jeejeelee
Thanks for the feedback. Working on the changes now.
We should also add a test for this - it's ok if it's just an end to end one (load a small model from huggingface hub and see if it works and gives good outputs)
the newly added file examples/qlora_inference.py is created for this purpose. In this file, both the case that bitsandbytes quantization with/withou LoRA adpaters are tested.
Here are the ouput I got in my local test ( of the four, the last is without a LORA adapter , the other three are with adpaters:
--------------------------------------------------------------------------
Prompt: The capital of France is
Output: Paris.
--------------------------------------------------------------------------
Prompt: The capital of USA is
Output: Washington DC.
--------------------------------------------------------------------------
Prompt: my name is
Output: john and i am a 20 year old male. i am a student at the university of maryland. i am a sophomore and i am majoring in business. i am a very outgoing person and i love to meet new people. i am a very social person and i love to party. i am a very outgoing person and i love to meet new people. i am a very social person and i love to party.
--------------------------------------------------------------------------
Prompt: My name is
Output: Kyle and I am a 20 year old college student. I am a huge fan of the outdoors and love to hike, camp, and fish. I am a very active person and love to stay busy. I am a very outgoing person and love to meet new people. I am a very easy going person and love to have fun. I am a very hard worker and love to work. I am a very trustworthy person and love to help people. I am a very caring person and love to help people. I am a very respectful person and love to respect others. I am a
@chenqianfzh example is fine, but we need an automated pytest test to run in CI to prevent regressions.
@chenqianfzh Can we add more quantization type examples in qlora_example.py, such as GPT+LoRA, so that users can refer to this script to learn how to utilize LoRA on quantized model, thanks
@jeejeelee @Yard1 @mgoin
I have updated the PR, addressing and resolving all the comments. Additionally, I have added the necessary unit tests. Could u please review it again?
However, I was unable to "add more quantization type examples in qlora_example.py" at this time. Currently, Llama is the only model supported in this MR. Expanding support to more models is my next task. As suggested by @jeejeelee, GPT will likely be the next model to support, given the availability of several GPT-based Qlora models, such as https://huggingface.co/vineetsharma/qlora-gpt-neox-20b-english_quotes/
It's important to note that adding support for GPT may require additional effort because VLLM does not currently support LoRA in GPT-Neox. Therefore, I believe it would be more appropriate to address this in a separate PR.
Thank you again for all the great suggestions.
@jeejeelee @Yard1 @mgoin
I have updated the PR, addressing and resolving all the comments. Additionally, I have added the necessary unit tests. Could u please review it again?
However, I was unable to "add more quantization type examples in qlora_example.py" at this time. Currently, Llama is the only model supported in this MR. Expanding support to more models is my next task. As suggested by @jeejeelee, GPT will likely be the next model to support, given the availability of several GPT-based Qlora models, such as https://huggingface.co/vineetsharma/qlora-gpt-neox-20b-english_quotes/
It's important to note that adding support for GPT may require additional effort because VLLM does not currently support LoRA in GPT-Neox. Therefore, I believe it would be more appropriate to address this in a separate PR.
Thank you again for all the great suggestions.
My apologies, there was an issue with my spelling. What I actually meant to say was GPTQ+LORA. I'm very sorry.
@jeejeelee @Yard1 @mgoin I have updated the PR, addressing and resolving all the comments. Additionally, I have added the necessary unit tests. Could u please review it again? However, I was unable to "add more quantization type examples in qlora_example.py" at this time. Currently, Llama is the only model supported in this MR. Expanding support to more models is my next task. As suggested by @jeejeelee, GPT will likely be the next model to support, given the availability of several GPT-based Qlora models, such as https://huggingface.co/vineetsharma/qlora-gpt-neox-20b-english_quotes/ It's important to note that adding support for GPT may require additional effort because VLLM does not currently support LoRA in GPT-Neox. Therefore, I believe it would be more appropriate to address this in a separate PR. Thank you again for all the great suggestions.
My apologies, there was an issue with my spelling. What I actually meant to say was GPTQ+LORA. I'm very sorry.
@jeejeelee I see. Yeah, I can do that.
I see you are the one who made LoRA support quant methods. Could you let me know your suggestions on what models to show the usage of GPTQ+LORA?
@jeejeelee @Yard1 @mgoin I have updated the PR, addressing and resolving all the comments. Additionally, I have added the necessary unit tests. Could u please review it again? However, I was unable to "add more quantization type examples in qlora_example.py" at this time. Currently, Llama is the only model supported in this MR. Expanding support to more models is my next task. As suggested by @jeejeelee, GPT will likely be the next model to support, given the availability of several GPT-based Qlora models, such as https://huggingface.co/vineetsharma/qlora-gpt-neox-20b-english_quotes/ It's important to note that adding support for GPT may require additional effort because VLLM does not currently support LoRA in GPT-Neox. Therefore, I believe it would be more appropriate to address this in a separate PR. Thank you again for all the great suggestions.
My apologies, there was an issue with my spelling. What I actually meant to say was GPTQ+LORA. I'm very sorry.
@jeejeelee I see. Yeah, I can do that.
I see you are the one who made LoRA support quant methods. Could you let me know your suggestions on what models to show the usage of GPTQ+LORA?
Thanks ,you can refer to test_quant_model.py
Thanks! In the future, please avoid force pushing as it makes reviews harder.
Thanks! In the future, please avoid force pushing as it makes reviews harder.
I will try to. However, as the main branch is moving really fast, I need to rebase from time to time and I have to force push after rebase.
Do you mean that I should keep the changes based on comments as a separate commit?
I recommend merging the main branch instead of rebasing. That way we have the entire commit history on the PR. You don't need to make any changes here, just an ask for the future!
@jeejeelee @Yard1
The PR was updated per your comments. Thanks for reviewing it. Could you take another look?
There are some CI failures, but it looks like someone is fixing it.
I tried to merge instead of rebase. However, I see an extra merge commit in my private branch. Any idea what has gone wrong?
Thanks.
@chenqianfzh the merge commit is expected, that's just how git works
@chenqianfzh the merge commit is expected, that's just how git works
I did something wrong in squashing commits before merging, so the commits are mixed. Sorry to make your review more difficult. :-(
Thanks, left two last nits! We can merge after those are resolved.
I've updated the code based on your feedback and have omitted one comment, for which I've provided an explanation. Could you please take a look?
thanks.
@Yard1 I kept trying the CI tests in the past two days. But hit all kinds of weird errors, like the latest failure is due to a container missing in AMD tests.
I did not find a way to restart the specific tests. Could you let me know what to do? Thanks.
It's OK, we'll just have a maintainer force merge it. Can you resolve https://github.com/vllm-project/vllm/pull/4776#discussion_r1619635289 and I will accept
@mgoin Thanks for reviewing the PR!
I updated the code per your comments. Could u have another check?
Hey, Thanks for the feature.
why do you make sure 'lm_head' is not quantized in your tests while peft accepts 'lm_head' among the target_modules? I was trying to run inference for a model fine-tuned with qlora and I get the following error:
File "/opt/conda/envs/llm/lib/python3.9/site-packages/vllm/model_executor/models/llama.py", line 436, in load_weights
param = params_dict[name]
KeyError: 'lm_head.qweight'