vllm
vllm copied to clipboard
[Feature][kernel] tensor parallelism with bitsandbytes quantization
This PR provides tensor parallelism to bitsandbytes quantization.
It is verified, on llama2 and llama3 models, that the generated texts are the same as no TP.
👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can do one of these:
- Add
readylabel to the PR - Enable auto-merge.
🚀
@mgoin @jeejeelee Could you help take a look at this PR which adds TP to bnb.
Wonder whether you can give me a hand in the test "test_load_tp_4bit_bnb_model" which I added in test_bitsandbytes.py. I have been working on this test for several days and consistently timed out with the following error:
FAILED tests/quantization/test_bitsandbytes.py::test_load_tp_4bit_bnb_model[huggyllama/llama-7b-quantize model inflight] - torch.distributed.DistStoreError: Timed out after 601 seconds waiting for clients. 1/2 clients joined.
Could you shed some lights on me? Thanks!
@mgoin @jeejeelee Could you help take a look at this PR which adds TP to bnb.
Wonder whether you can give me a hand in the test "test_load_tp_4bit_bnb_model" which I added in test_bitsandbytes.py. I have been working on this test for several days and consistently timed out with the following error:
FAILED tests/quantization/test_bitsandbytes.py::test_load_tp_4bit_bnb_model[huggyllama/llama-7b-quantize model inflight] - torch.distributed.DistStoreError: Timed out after 601 seconds waiting for clients. 1/2 clients joined.Could you shed some lights on me? Thanks!
@youkaichao could you please look at this error?
where is the error? does it show up in the ci?
where is the error? does it show up in the ci?
Thanks for help!
The error is :
FAILED tests/quantization/test_bitsandbytes.py::test_load_tp_4bit_bnb_model[huggyllama/llama-7b-quantize model inflight] - torch.distributed.DistStoreError: Timed out after 601 seconds waiting for clients. 1/2 clients joined.
The test case test_load_tp_4bit_bnb_mode is in the test_bitsandbytes.py in this PR. I commented it out.
I think you need https://github.com/vllm-project/vllm/pull/8449 .
after #8449 is merged, please merge main and also add @fork_new_process_for_each_test to your test function.
after #8449 is merged, please merge main and also add
@fork_new_process_for_each_testto your test function.
@youkaichao Woohoo! Thanks a million!
Could you educate me why "@fork_new_process_for_each_test" made a difference?
if you want to use tp with vllm, you need a clean process without cuda initialized.
@jeejeelee @mgoin I updated the PR based on youkaichao's fix. Could you take another look?
@mgoin Updated per your comments. Please review again. Thanks.
I tested this PR with a trained QLoRA adapter and I am getting this error:
KeyError: 'lm_head.qweight'
Might this be due to only checking for certain adapter weights? EDIT: no ;) it loads the adapter_config.json and checks the target modules. It can be that 'lm_head' is no longer a weight that is saved.
Thanks for the hard work, this feature is really important! What is the progress of this feature now? I currently have a 405b model, merged with lora and quantized to int8 using bnb. Is there a PR to achieve TP=8 inference for this model?
@junzhang-zj lol I have exactly the same use case ;p
@jvlinsta Yes, the hf version is slow to evaluate (/(ㄒoㄒ)/~~
Hi @junzhang-zj this PR will not support TP on prequantized models, see this comment https://github.com/vllm-project/vllm/pull/8434#discussion_r1759977041
You should be able to use PP=8 though to distribute the model layers across all your GPUs. Or you could take a look at quantizing your model using llmcompressor to FP8 or INT8 where we support performant activation quantization TP https://docs.vllm.ai/en/latest/quantization/fp8.html
@mgoin Thanks, LGTM! I will try it. The only device available to me is A100, so it seems I can only try PP=8.
I tested this PR with a trained QLoRA adapter and I am getting this error:
KeyError: 'lm_head.qweight'Might this be due to only checking for certain adapter weights? #8434 (comment)
Yeah so it seems:
# change so that it works with other models as well default_target_modules = [ "gate_proj", "down_proj", "up_proj", "q_proj", "k_proj", "v_proj", "o_proj" ]
@jvlinsta I suspect it is an issue specific to some models, as I did not repro this issues with each of my pre-quant test models.
Could you share you model and the params of your command line? I will take a close look.
@chenqianfzh I am stuck with loading quantized 405b int8 using bnb pp=8 via vllm:
lm_eval --model vllm \
--model_args pretrained=$merge_path,pipeline_parallel_size=$parallelize,dtype=bfloat16,gpu_memory_utilization=0.80,data_parallel_size=1,quantization='bitsandbytes',load_format='bitsandbytes' \
--tasks arc_challenge,piqa,hellaswag,arc_easy,winogrande,openbookqa \
--batch_size auto --num_fewshot 1 --log_samples \
--output_path $path/CSR \
--use_cache $path/CSR > $path/CSR/eval_1shot.log 2>&1
Loading safetensors checkpoint shards: 100% Completed | 86/86 [10:26<00:00, 7.28s/it] [1;36m(VllmWorkerProcess pid=95795)[0;0m ERROR 09-17 16:34:25 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: Parameter model.layers.0.self_attn.qkv_proj.qweight not found in the model., Traceback (most recent call last):
@chenqianfzh I am stuck with loading quantized 405b int8 using bnb pp=8 via vllm:
lm_eval --model vllm \ --model_args pretrained=$merge_path,pipeline_parallel_size=$parallelize,dtype=bfloat16,gpu_memory_utilization=0.80,data_parallel_size=1,quantization='bitsandbytes',load_format='bitsandbytes' \ --tasks arc_challenge,piqa,hellaswag,arc_easy,winogrande,openbookqa \ --batch_size auto --num_fewshot 1 --log_samples \ --output_path $path/CSR \ --use_cache $path/CSR > $path/CSR/eval_1shot.log 2>&1Loading safetensors checkpoint shards: 100% Completed | 86/86 [10:26<00:00, 7.28s/it] �[1;36m(VllmWorkerProcess pid=95795)�[0;0m ERROR 09-17 16:34:25 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: Parameter model.layers.0.self_attn.qkv_proj.qweight not found in the model., Traceback (most recent call last):
from "pretrained=$merge_path", I cannot figure out the model that you are using. Is it a public model? please share the model name.
Could you take a look at the weight param names? In vanilla llama models, there are weights of 'q_proj', 'k_proj', 'v_proj', but there is no 'qkv_proj'. 'qkv_proj' is created by vllm internally via combining the loaded weights of 'q_proj', 'k_proj', 'v_proj'.
@chenqianfzh The model is llama-3.1-405b int8.
@chenqianfzh The model is llama-3.1-405b int8.
could u point me to the model path? searching "llama-3.1-405b int8" did not pinpoint a specific model.
@chenqianfzh Sorry for my incorrect expression. I used load_in_8bit to quantize llama-3.1-405 in hf, and then saved it. Then I used vllm to infer it. Do I have any alternatives, such as using the original weights and then quantizing and inferring them on vllm?
Target: Infer a 405B model integrated with LoRA on an 8 A100-80G Accessible Model: Merged 405B BF16; Merged 405B INT8 (save from hf using load_in_8bit)
Error:
[rank0]: File "/home/notebook/code/personal/vllm/vllm/worker/worker.py", line 183, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/home/notebook/code/personal/vllm/vllm/worker/model_runner.py", line 1016, in load_model
[rank0]: self.model = get_model(model_config=self.model_config,
[rank0]: File "/home/notebook/code/personal/vllm/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: File "/home/notebook/code/personal/vllm/vllm/model_executor/model_loader/loader.py", line 1077, in load_model
[rank0]: self._load_weights(model_config, model)
[rank0]: File "/home/notebook/code/personal/vllm/vllm/model_executor/model_loader/loader.py", line 1031, in _load_weights
[rank0]: raise ValueError(
[rank0]: ValueError: Parameter model.layers.15.self_attn.qkv_proj.qweight not found in the model.
INT8 405 Model Config:
{
"_name_or_path": "",
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"eos_token_id": 128001,
"hidden_act": "silu",
"hidden_size": 16384,
"initializer_range": 0.02,
"intermediate_size": 53248,
"max_position_embeddings": 131072,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 128,
"num_hidden_layers": 126,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"quantization_config": {
"_load_in_4bit": false,
"_load_in_8bit": true,
"bnb_4bit_compute_dtype": "float32",
"bnb_4bit_quant_storage": "uint8",
"bnb_4bit_quant_type": "fp4",
"bnb_4bit_use_double_quant": false,
"llm_int8_enable_fp32_cpu_offload": false,
"llm_int8_has_fp16_weight": false,
"llm_int8_skip_modules": null,
"llm_int8_threshold": 6.0,
"load_in_4bit": false,
"load_in_8bit": true,
"quant_method": "bitsandbytes"
},
"rms_norm_eps": 1e-05,
"rope_scaling": {
"factor": 8.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},
"rope_theta": 500000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.44.2",
"use_cache": true,
"vocab_size": 128256
}
@chenqianfzh I suspect that there is an error in processing the weights names, I am giving the weight names loaded on hf here, and I am checking the names on vllm.
'model.layers.0.self_attn.q_proj.weight', 'model.layers.0.self_attn.q_proj.SCB', 'model.layers.0.self_attn.q_proj.weight_format', 'model.layers.0.self_attn.k_proj.weight', 'model.layers.0.self_attn.k_proj.SCB', 'model.layers.0.self_attn.k_proj.weight_format', 'model.layers.0.self_attn.v_proj.weight', 'model.layers.0.self_attn.v_proj.SCB', 'model.layers.0.self_attn.v_proj.weight_format', 'model.layers.0.self_attn.o_proj.weight', 'model.layers.0.self_attn.o_proj.SCB', 'model.layers.0.self_attn.o_proj.weight_format', 'model.layers.0.mlp.gate_proj.weight', 'model.layers.0.mlp.gate_proj.SCB', 'model.layers.0.mlp.gate_proj.weight_format', 'model.layers.0.mlp.up_proj.weight', 'model.layers.0.mlp.up_proj.SCB', 'model.layers.0.mlp.up_proj.weight_format', 'model.layers.0.mlp.down_proj.weight', 'model.layers.0.mlp.down_proj.SCB', 'model.layers.0.mlp.down_proj.weight_format', 'model.layers.0.input_layernorm.weight', 'model.layers.0.post_attention_layernorm.weight'
.layers.0.input_layernorm.weight', 'model.layers.0.post_attention_layernorm.weight'
Do you have to use 8bit? If you are OK with 4bit quantization, in-flight quantization is supported. You can try with the param:
load_format="bitsandbytes", quantization="bitsandbytes", enforce_eager=True, tensor_parallel_size=8
The problem is that the model loading will be slow. All the quantized weights are gone when you shut down.
If you need to stick to 8bit prequant model, wonder whether you can share your pre-quant model (for example, upload to your hf repo)? Or you can compare your weight modules with meta-llama/Llama-Guard-3-8B-INT8, which works fine in my test.
Please let me know what is the best choice for you. I will try my very best to help. :-)
@chenqianfzh I should still use int8, because the performance of 405b is very sensitive to quantization.
I checked the read weights (Figure 1, quant_state_dict) and they are all normal,
but the result of model initialization (model.load_weights(qweight_iterator)) seems very strange, as shown in Figure 2.
I will try to upload my pre-quantized to hf, but it may take time because of the large number of parameters.
I tested this PR with a trained QLoRA adapter and I am getting this error:
KeyError: 'lm_head.qweight'Might this be due to only checking for certain adapter weights? #8434 (comment) Yeah so it seems:# change so that it works with other models as well default_target_modules = [ "gate_proj", "down_proj", "up_proj", "q_proj", "k_proj", "v_proj", "o_proj" ]@jvlinsta I suspect it is an issue specific to some models, as I did not repro this issues with each of my pre-quant test models.
Could you share you model and the params of your command line? I will take a close look.
Hi @chenqianfzh, those are LoRA adapters trained on proprietary data, so cannot share :/ In any case, I resolved it by removing 'lm_head' in my adapter_config.json, and the adapters seem to be working fine now :)
Maybe a related thread on this: https://github.com/artidoro/qlora/issues/13 I also train with FSDP (with under the hood Deepspeed) from accelerate, so maybe that is why.
Anyway, no issue with your PR ;) Rather might be an issue that other folks run into, then they can find a resolution here ^^
@chenqianfzh I have pushed to hf, model name is zayjean/405B-INT8-LoRAM, can you test it on your environment and code? I really hope to debug this problem with your help.
Is there any update on this issue? :'(
@chenqianfzh I have pushed to hf, model name is zayjean/405B-INT8-LoRAM, can you test it on your environment and code? I really hope to debug this problem with your help.
Is there any update on this issue? :'(
Thanks for sharing the model.
Oh, it is such a huge model. I tried with my most powerful machine. Could not finish weight tensor initialization, even with the maximum weight cpu-offload.
I took a look at config files in your shared repo. They just look fine.
From the screenshot you shared, your model also looks fine. There is NO 'qkv_proj' weight module, but separate 'q_proj', 'k_proj', 'v_proj', which is correct.
Did you try out in some smaller llama3 models? If yes, did you see the same error?
@chenqianfzh Yes, it seems that at least 8 A100-80G are needed to evaluate this model. I have not tested the small llama3 model yet, because my device is running the HF version of this model evaluation, which is a higher priority experiment for me. When I have free time, I will try the llama3-70B later. I don't know if your environment is enough for evaluation.