vllm icon indicating copy to clipboard operation
vllm copied to clipboard

support QLoRA

Open chenqianfzh opened this issue 1 year ago • 8 comments

QLoRA (https://arxiv.org/abs/2305.14314) cuts memory consumption in LLM weight loading without degrading performance. The weights of the basic model , which are quantized into 4 bit, pair with a low-rank but higher-precision Low-Rank weight matrix to generate output.

This MR is the first step in supporting QLoRA in vLLM. With the PR, the Qlora author's open model on hugging face, such as, is supported:

  • https://huggingface.co/timdettmers/qlora-flan-7b (Its corresponding large model is "huggyllama/llama-7b")

User can run with or without a QLoRA adapter.

So far, only llama as a basic model is supported. More to come in the future. As explained below, special consideration is made for extensibility to future changes and other models. Also, no TP or PP with QLoRA is supported. It will be considered as the immediate next effort.

Explanation on Changes

Modified files mainly include

  • Modify vllm/config.py, vllm/engine/arg_utils.py: Add new CLI parameters for QLoRA. Tow new parameters are added
    • qlora_adapter_name_or_path : the path to the adpater repo. Could be empty.
    • qlora_refresh_quant_cache: whether re-use the cache of existing basic model quantized weights
  • Modify vllm/model_loader/weight_utils.py: Add logic to read adapter_configuration from LoRA's hugging face repo.
  • Modify vllm/model_executor/layers/linear.py: Add the logic of concatenate tensor in QLoRA in the weight_loader () function of QKVParallelLinear class and MergedColumnParallelLinear class.
  • Modify vllm/model_loader/loader.py: In the case of Qlora, use qlora_load_weights of the model to the load_weights of the QLoRA version, instead of the original load_weights.

The newly added files are:

  • VLLM/model_executor/layers/quantization/qlora.py: Here, similar to other quantization methods, we define two classes, class QLoRAConfig (QuantizationConfig) and class QLoRALinearMethod (LinearMethodBase). This is the core change of the entire PR.
  • VLLM/model_executor/layers/quantization/qlora_utils.py: This file includes the functions necessary to QLoRA implementation:
    • A decorator class to mark a model as QLoRA supported. So far, only llama is marked. When running QLoRA with non-verified models, warning will be given.
    • The qlora_load_weight func. This func is designed in a such a way that the original load_weights func in the original model is untouched. The qlora_load_weights() func is agnostic of model architecture as much as possible. qlora_load_weights is designed to be a wrapper of the existing model.load_weights() to ensure extensibility to models other than llama as well as the future changes in the already-qlora-supported model.
  • Examples/qlora_offline_inference.py: Demonstration of the use of QLORA, both with and without an adapter.

chenqianfzh avatar May 12 '24 23:05 chenqianfzh

ping @Yard1

jeejeelee avatar May 13 '24 02:05 jeejeelee

This completely bypasses the existing LoRA logic and implements its own. I don't think this is a good design and it clashes with already existing code. We should instead modify the LoRA support already present in vLLM to support QLoRA - it should also allow us to reuse a lot of existing code.

Thanks for your reply. You are not the first one who popped this concern. Actually, I asked myself the same question. :-)

I considered about re-use of LoRA at the first place. I have to start a new set of code because:

  1. the existing LoRA in vLLM is implementing punica (https://github.com/punica-ai/punica), a multi-tenant scenario of LoRA. A lot of effort have made on the LoRA manager, which manages the cases where different sets of fine-tune weights using the same basic models.

But QLoRA, though carring a very similar name, work for a totally different scenario, thus unable to re-use the existing code of LoRA in vLLM.

  1. punica is based on cuda code of BGMV, and BGMV does not support any quantization. But in QLORA, quantization of basic model is the keypoint in saving memory. This is another reason I had to deviate away from reusing LoRA.

  2. On the other hand, QLoRA use a different set of cuda code. The author of QLORA provides the Cuda implementation of QLORA implemention and packed in the python package of bitsandbytes, which is used in the QLORA implementation of huggingface transformers package. So I moved away from re-using the LoRA code.

How about I add some comments somewhere to clarify your concern?

chenqianfzh avatar May 13 '24 17:05 chenqianfzh

Is it theoretically possible for the QLoRA adapter to be loaded and unloaded at will?

Yard1 avatar May 13 '24 17:05 Yard1

Is it theoretically possible for the QLoRA adapter to be loaded and unloaded at will?

I am not sure what you mean by "at will". Do you mean load/unload during runtime?

In this implementation, user can load an adpater by specfiying "qlora_adapter_name_or_path" in parameter when starting the inference. User can also run without an adapter by leaving the above parameter empty.

However, the user cannot switch the adapter during the runtime. Switching adapter is not a scenario supported in the QLoRA design.

The main goal of QLoRA is to use to LoRA weights to compensate the loss caused by the 4-bit quantization in the basic model. So it is a quantization technique. Switching LoRA to support different fine-tune scenarios, as in punica, is not in its design goals.

chenqianfzh avatar May 13 '24 21:05 chenqianfzh

Ok, that's what I wanted to confirm. Thanks for clearing it up. In that case:

  1. for consistency, I would suggest ditching the qlora_supported decorator and just specify the class attribute directly on the class
  2. we should avoid the if model_config.quantization == "qlora": pattern in linear layer and weight loading code - instead we should use abstractions (and add them if they are missing). For example, we should add a QLoRAModelLoader which can subclass/compose DefaultModelLoader. Same for linear layer - we should avoid adding special cases to generic implementations (I understand this pattern is not always followed in the codebase, but we should hold new code to higher standard - happy to discuss what sort of API we need to add to get rid of the Special case for Quantized Weights. in linear layer implementation)

Yard1 avatar May 13 '24 21:05 Yard1

Ok, that's what I wanted to confirm. Thanks for clearing it up. In that case:

  1. for consistency, I would suggest ditching the qlora_supported decorator and just specify the class attribute directly on the class
  2. we should avoid the if model_config.quantization == "qlora": pattern in linear layer and weight loading code - instead we should use abstractions (and add them if they are missing). For example, we should add a QLoRAModelLoader which can subclass/compose DefaultModelLoader. Same for linear layer - we should avoid adding special cases to generic implementations (I understand this pattern is not always followed in the codebase, but we should hold new code to higher standard - happy to discuss what sort of API we need to add to get rid of the Special case for Quantized Weights. in linear layer implementation)

Thanks for the suggestion. I will make the changes as suggested.

Cheers!

chenqianfzh avatar May 14 '24 00:05 chenqianfzh

Thank you for your excellent work. Here are some personal opinions:

  • vLLM has supported quantized models with LoRA, refer to quant model+lora. These can be generalized as QLoRA (e.g., GPTQ+LoRA), and all of them support switching adapters.
  • For the original QLoRA (https://arxiv.org/abs/2305.14314), I think we should add a new quantization method named bitsandbytes (e.g., BAB+LoRA), refer to https://github.com/vllm-project/vllm/issues/4033, and then we can reuse the current LoRA logic.
  • Regardless of LoRA or QLoRA, Punica can support these

If I am wrong, please correct me directly, Thanks again.

Cheers!

jeejeelee avatar May 14 '24 02:05 jeejeelee

Thank you for your excellent work. Here are some personal opinions:

  • vLLM has supported quantized models with LoRA, refer to quant model+lora. These can be generalized as QLoRA (e.g., GPTQ+LoRA), and all of them support switching adapters.
  • For the original QLoRA (https://arxiv.org/abs/2305.14314), I think we should add a new quantization method named bitsandbytes (e.g., BAB+LoRA), refer to [Feature]: bitsandbytes support #4033, and then we can reuse the current LoRA logic.
  • Regardless of LoRA or QLoRA, Punica can support these

If I am wrong, please correct me directly, Thanks again.

Cheers!

I re-read the LoRA code carefully and saw that quantization is supported in LoRA now. It was not supported when I started my design and coding. Sorry for the miss.

I will re-think my design again based on this change, as well as Yard1's suggestions.

Thanks & Happy Coding!

chenqianfzh avatar May 14 '24 20:05 chenqianfzh

@Yard1 @jeejeelee

I just updated the MR of QLoRA/BitsAndBytes with the changes suggested. Could you please take another look?

Thanks for the great advice from you. Learned a lot and improved a lot. :-)

BTW, I hit a lot of yapf errors in CI/CD. I found the the yapf errors are not from me. Should I just ignore it?

chenqianfzh avatar May 23 '24 01:05 chenqianfzh

@chenqianfzh We cannot igore format error, you can run bash format.sh to check for format errors

jeejeelee avatar May 23 '24 02:05 jeejeelee

We should also add a test for this - it's ok if it's just an end to end one (load a small model from huggingface hub and see if it works and gives good outputs)

Yard1 avatar May 23 '24 07:05 Yard1

@mgoin @Yard1 @jeejeelee

Thanks for the feedback. Working on the changes now.

chenqianfzh avatar May 23 '24 16:05 chenqianfzh

We should also add a test for this - it's ok if it's just an end to end one (load a small model from huggingface hub and see if it works and gives good outputs)

the newly added file examples/qlora_inference.py is created for this purpose. In this file, both the case that bitsandbytes quantization with/withou LoRA adpaters are tested.

Here are the ouput I got in my local test ( of the four, the last is without a LORA adapter , the other three are with adpaters:

--------------------------------------------------------------------------
Prompt: The capital of France is 
Output:  Paris.
--------------------------------------------------------------------------
Prompt: The capital of USA is 
Output:  Washington DC.
--------------------------------------------------------------------------
Prompt: my name is 
Output:  john and i am a 20 year old male. i am a student at the university of maryland. i am a sophomore and i am majoring in business. i am a very outgoing person and i love to meet new people. i am a very social person and i love to party. i am a very outgoing person and i love to meet new people. i am a very social person and i love to party.
--------------------------------------------------------------------------
Prompt: My name is 
Output:  Kyle and I am a 20 year old college student. I am a huge fan of the outdoors and love to hike, camp, and fish. I am a very active person and love to stay busy. I am a very outgoing person and love to meet new people. I am a very easy going person and love to have fun. I am a very hard worker and love to work. I am a very trustworthy person and love to help people. I am a very caring person and love to help people. I am a very respectful person and love to respect others. I am a

chenqianfzh avatar May 24 '24 06:05 chenqianfzh

@chenqianfzh example is fine, but we need an automated pytest test to run in CI to prevent regressions.

Yard1 avatar May 24 '24 17:05 Yard1

@chenqianfzh Can we add more quantization type examples in qlora_example.py, such as GPT+LoRA, so that users can refer to this script to learn how to utilize LoRA on quantized model, thanks

jeejeelee avatar May 25 '24 02:05 jeejeelee

@jeejeelee @Yard1 @mgoin

I have updated the PR, addressing and resolving all the comments. Additionally, I have added the necessary unit tests. Could u please review it again?

However, I was unable to "add more quantization type examples in qlora_example.py" at this time. Currently, Llama is the only model supported in this MR. Expanding support to more models is my next task. As suggested by @jeejeelee, GPT will likely be the next model to support, given the availability of several GPT-based Qlora models, such as https://huggingface.co/vineetsharma/qlora-gpt-neox-20b-english_quotes/

It's important to note that adding support for GPT may require additional effort because VLLM does not currently support LoRA in GPT-Neox. Therefore, I believe it would be more appropriate to address this in a separate PR.

Thank you again for all the great suggestions.

chenqianfzh avatar May 28 '24 07:05 chenqianfzh

@jeejeelee @Yard1 @mgoin

I have updated the PR, addressing and resolving all the comments. Additionally, I have added the necessary unit tests. Could u please review it again?

However, I was unable to "add more quantization type examples in qlora_example.py" at this time. Currently, Llama is the only model supported in this MR. Expanding support to more models is my next task. As suggested by @jeejeelee, GPT will likely be the next model to support, given the availability of several GPT-based Qlora models, such as https://huggingface.co/vineetsharma/qlora-gpt-neox-20b-english_quotes/

It's important to note that adding support for GPT may require additional effort because VLLM does not currently support LoRA in GPT-Neox. Therefore, I believe it would be more appropriate to address this in a separate PR.

Thank you again for all the great suggestions.

My apologies, there was an issue with my spelling. What I actually meant to say was GPTQ+LORA. I'm very sorry.

jeejeelee avatar May 28 '24 15:05 jeejeelee

@jeejeelee @Yard1 @mgoin I have updated the PR, addressing and resolving all the comments. Additionally, I have added the necessary unit tests. Could u please review it again? However, I was unable to "add more quantization type examples in qlora_example.py" at this time. Currently, Llama is the only model supported in this MR. Expanding support to more models is my next task. As suggested by @jeejeelee, GPT will likely be the next model to support, given the availability of several GPT-based Qlora models, such as https://huggingface.co/vineetsharma/qlora-gpt-neox-20b-english_quotes/ It's important to note that adding support for GPT may require additional effort because VLLM does not currently support LoRA in GPT-Neox. Therefore, I believe it would be more appropriate to address this in a separate PR. Thank you again for all the great suggestions.

My apologies, there was an issue with my spelling. What I actually meant to say was GPTQ+LORA. I'm very sorry.

@jeejeelee I see. Yeah, I can do that.

I see you are the one who made LoRA support quant methods. Could you let me know your suggestions on what models to show the usage of GPTQ+LORA?

chenqianfzh avatar May 28 '24 17:05 chenqianfzh

@jeejeelee @Yard1 @mgoin I have updated the PR, addressing and resolving all the comments. Additionally, I have added the necessary unit tests. Could u please review it again? However, I was unable to "add more quantization type examples in qlora_example.py" at this time. Currently, Llama is the only model supported in this MR. Expanding support to more models is my next task. As suggested by @jeejeelee, GPT will likely be the next model to support, given the availability of several GPT-based Qlora models, such as https://huggingface.co/vineetsharma/qlora-gpt-neox-20b-english_quotes/ It's important to note that adding support for GPT may require additional effort because VLLM does not currently support LoRA in GPT-Neox. Therefore, I believe it would be more appropriate to address this in a separate PR. Thank you again for all the great suggestions.

My apologies, there was an issue with my spelling. What I actually meant to say was GPTQ+LORA. I'm very sorry.

@jeejeelee I see. Yeah, I can do that.

I see you are the one who made LoRA support quant methods. Could you let me know your suggestions on what models to show the usage of GPTQ+LORA?

Thanks ,you can refer to test_quant_model.py

jeejeelee avatar May 28 '24 17:05 jeejeelee

Thanks! In the future, please avoid force pushing as it makes reviews harder.

Yard1 avatar May 28 '24 17:05 Yard1

Thanks! In the future, please avoid force pushing as it makes reviews harder.

I will try to. However, as the main branch is moving really fast, I need to rebase from time to time and I have to force push after rebase.

Do you mean that I should keep the changes based on comments as a separate commit?

chenqianfzh avatar May 28 '24 18:05 chenqianfzh

I recommend merging the main branch instead of rebasing. That way we have the entire commit history on the PR. You don't need to make any changes here, just an ask for the future!

Yard1 avatar May 28 '24 19:05 Yard1

@jeejeelee @Yard1

The PR was updated per your comments. Thanks for reviewing it. Could you take another look?

There are some CI failures, but it looks like someone is fixing it.

I tried to merge instead of rebase. However, I see an extra merge commit in my private branch. Any idea what has gone wrong?

Thanks.

chenqianfzh avatar May 29 '24 06:05 chenqianfzh

@chenqianfzh the merge commit is expected, that's just how git works

Yard1 avatar May 29 '24 17:05 Yard1

@chenqianfzh the merge commit is expected, that's just how git works

I did something wrong in squashing commits before merging, so the commits are mixed. Sorry to make your review more difficult. :-(

chenqianfzh avatar May 29 '24 17:05 chenqianfzh

Thanks, left two last nits! We can merge after those are resolved.

I've updated the code based on your feedback and have omitted one comment, for which I've provided an explanation. Could you please take a look?

thanks.

chenqianfzh avatar May 30 '24 06:05 chenqianfzh

@Yard1 I kept trying the CI tests in the past two days. But hit all kinds of weird errors, like the latest failure is due to a container missing in AMD tests.

I did not find a way to restart the specific tests. Could you let me know what to do? Thanks.

chenqianfzh avatar May 31 '24 20:05 chenqianfzh

It's OK, we'll just have a maintainer force merge it. Can you resolve https://github.com/vllm-project/vllm/pull/4776#discussion_r1619635289 and I will accept

Yard1 avatar May 31 '24 22:05 Yard1

@mgoin Thanks for reviewing the PR!

I updated the code per your comments. Could u have another check?

chenqianfzh avatar Jun 01 '24 06:06 chenqianfzh

Hey, Thanks for the feature.

why do you make sure 'lm_head' is not quantized in your tests while peft accepts 'lm_head' among the target_modules? I was trying to run inference for a model fine-tuned with qlora and I get the following error:

File "/opt/conda/envs/llm/lib/python3.9/site-packages/vllm/model_executor/models/llama.py", line 436, in load_weights
    param = params_dict[name]
KeyError: 'lm_head.qweight'

sajadn avatar Jun 13 '24 21:06 sajadn