vllm [Feature][kernel] tensor parallelism with bitsandbytes quantization

This PR provides tensor parallelism to bitsandbytes quantization.

It is verified, on llama2 and llama3 models, that the generated texts are the same as no TP.

Sep 13 '24 01:09 chenqianfzh

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Sep 13 '24 01:09 github-actions[bot]

@mgoin @jeejeelee Could you help take a look at this PR which adds TP to bnb.

Wonder whether you can give me a hand in the test "test_load_tp_4bit_bnb_model" which I added in test_bitsandbytes.py. I have been working on this test for several days and consistently timed out with the following error:

FAILED tests/quantization/test_bitsandbytes.py::test_load_tp_4bit_bnb_model[huggyllama/llama-7b-quantize model inflight] - torch.distributed.DistStoreError: Timed out after 601 seconds waiting for clients. 1/2 clients joined.

Could you shed some lights on me? Thanks!

Sep 13 '24 01:09 chenqianfzh

@mgoin @jeejeelee Could you help take a look at this PR which adds TP to bnb.

Wonder whether you can give me a hand in the test "test_load_tp_4bit_bnb_model" which I added in test_bitsandbytes.py. I have been working on this test for several days and consistently timed out with the following error:
FAILED tests/quantization/test_bitsandbytes.py::test_load_tp_4bit_bnb_model[huggyllama/llama-7b-quantize model inflight] - torch.distributed.DistStoreError: Timed out after 601 seconds waiting for clients. 1/2 clients joined.
Could you shed some lights on me? Thanks!

@youkaichao could you please look at this error?

Sep 13 '24 04:09 jeejeelee

where is the error? does it show up in the ci?

Sep 13 '24 06:09 youkaichao

where is the error? does it show up in the ci?

Thanks for help!

The error is :

FAILED tests/quantization/test_bitsandbytes.py::test_load_tp_4bit_bnb_model[huggyllama/llama-7b-quantize model inflight] - torch.distributed.DistStoreError: Timed out after 601 seconds waiting for clients. 1/2 clients joined.

The test case test_load_tp_4bit_bnb_mode is in the test_bitsandbytes.py in this PR. I commented it out.

Sep 13 '24 07:09 chenqianfzh

I think you need https://github.com/vllm-project/vllm/pull/8449 .

Sep 13 '24 07:09 youkaichao

after #8449 is merged, please merge main and also add @fork_new_process_for_each_test to your test function.

Sep 13 '24 07:09 youkaichao

after #8449 is merged, please merge main and also add @fork_new_process_for_each_test to your test function.

@youkaichao Woohoo! Thanks a million!

Could you educate me why "@fork_new_process_for_each_test" made a difference?

Sep 13 '24 17:09 chenqianfzh

if you want to use tp with vllm, you need a clean process without cuda initialized.

Sep 13 '24 17:09 youkaichao

@jeejeelee @mgoin I updated the PR based on youkaichao's fix. Could you take another look?

Sep 13 '24 18:09 chenqianfzh

@mgoin Updated per your comments. Please review again. Thanks.

Sep 16 '24 18:09 chenqianfzh

I tested this PR with a trained QLoRA adapter and I am getting this error: KeyError: 'lm_head.qweight'

Might this be due to only checking for certain adapter weights? EDIT: no ;) it loads the adapter_config.json and checks the target modules. It can be that 'lm_head' is no longer a weight that is saved.

Sep 17 '24 12:09 jvlinsta

Thanks for the hard work, this feature is really important! What is the progress of this feature now? I currently have a 405b model, merged with lora and quantized to int8 using bnb. Is there a PR to achieve TP=8 inference for this model?

Sep 17 '24 13:09 junzhang-zj

@junzhang-zj lol I have exactly the same use case ;p

Sep 17 '24 13:09 jvlinsta

@jvlinsta Yes, the hf version is slow to evaluate (/(ㄒoㄒ)/~~

Sep 17 '24 13:09 junzhang-zj

Hi @junzhang-zj this PR will not support TP on prequantized models, see this comment https://github.com/vllm-project/vllm/pull/8434#discussion_r1759977041

You should be able to use PP=8 though to distribute the model layers across all your GPUs. Or you could take a look at quantizing your model using llmcompressor to FP8 or INT8 where we support performant activation quantization TP https://docs.vllm.ai/en/latest/quantization/fp8.html

Sep 17 '24 14:09 mgoin

@mgoin Thanks, LGTM! I will try it. The only device available to me is A100, so it seems I can only try PP=8.

Sep 17 '24 14:09 junzhang-zj

I tested this PR with a trained QLoRA adapter and I am getting this error: KeyError: 'lm_head.qweight'

Might this be due to only checking for certain adapter weights? #8434 (comment)

Yeah so it seems:
# change so that it works with other models as well
default_target_modules = [
    "gate_proj", "down_proj", "up_proj", "q_proj", "k_proj", "v_proj",
    "o_proj"
]

@jvlinsta I suspect it is an issue specific to some models, as I did not repro this issues with each of my pre-quant test models.

Could you share you model and the params of your command line? I will take a close look.

Sep 17 '24 15:09 chenqianfzh

@chenqianfzh I am stuck with loading quantized 405b int8 using bnb pp=8 via vllm:

    lm_eval --model vllm \
    --model_args pretrained=$merge_path,pipeline_parallel_size=$parallelize,dtype=bfloat16,gpu_memory_utilization=0.80,data_parallel_size=1,quantization='bitsandbytes',load_format='bitsandbytes' \
    --tasks arc_challenge,piqa,hellaswag,arc_easy,winogrande,openbookqa \
    --batch_size auto --num_fewshot 1 --log_samples \
    --output_path $path/CSR \
    --use_cache $path/CSR > $path/CSR/eval_1shot.log 2>&1

Loading safetensors checkpoint shards: 100% Completed | 86/86 [10:26<00:00, 7.28s/it] [1;36m(VllmWorkerProcess pid=95795)[0;0m ERROR 09-17 16:34:25 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: Parameter model.layers.0.self_attn.qkv_proj.qweight not found in the model., Traceback (most recent call last):

Sep 17 '24 16:09 junzhang-zj

@chenqianfzh I am stuck with loading quantized 405b int8 using bnb pp=8 via vllm:
    lm_eval --model vllm \
    --model_args pretrained=$merge_path,pipeline_parallel_size=$parallelize,dtype=bfloat16,gpu_memory_utilization=0.80,data_parallel_size=1,quantization='bitsandbytes',load_format='bitsandbytes' \
    --tasks arc_challenge,piqa,hellaswag,arc_easy,winogrande,openbookqa \
    --batch_size auto --num_fewshot 1 --log_samples \
    --output_path $path/CSR \
    --use_cache $path/CSR > $path/CSR/eval_1shot.log 2>&1
Loading safetensors checkpoint shards: 100% Completed | 86/86 [10:26<00:00, 7.28s/it] �[1;36m(VllmWorkerProcess pid=95795)�[0;0m ERROR 09-17 16:34:25 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: Parameter model.layers.0.self_attn.qkv_proj.qweight not found in the model., Traceback (most recent call last):

from "pretrained=$merge_path", I cannot figure out the model that you are using. Is it a public model? please share the model name.

Could you take a look at the weight param names? In vanilla llama models, there are weights of 'q_proj', 'k_proj', 'v_proj', but there is no 'qkv_proj'. 'qkv_proj' is created by vllm internally via combining the loaded weights of 'q_proj', 'k_proj', 'v_proj'.

Sep 17 '24 20:09 chenqianfzh

@chenqianfzh The model is llama-3.1-405b int8.

Sep 18 '24 00:09 junzhang-zj

@chenqianfzh The model is llama-3.1-405b int8.

could u point me to the model path? searching "llama-3.1-405b int8" did not pinpoint a specific model.

Sep 18 '24 01:09 chenqianfzh

@chenqianfzh Sorry for my incorrect expression. I used load_in_8bit to quantize llama-3.1-405 in hf, and then saved it. Then I used vllm to infer it. Do I have any alternatives, such as using the original weights and then quantizing and inferring them on vllm?

Target: Infer a 405B model integrated with LoRA on an 8 A100-80G Accessible Model: Merged 405B BF16； Merged 405B INT8 (save from hf using load_in_8bit)

Error:

[rank0]:   File "/home/notebook/code/personal/vllm/vllm/worker/worker.py", line 183, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/home/notebook/code/personal/vllm/vllm/worker/model_runner.py", line 1016, in load_model
[rank0]:     self.model = get_model(model_config=self.model_config,
[rank0]:   File "/home/notebook/code/personal/vllm/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:   File "/home/notebook/code/personal/vllm/vllm/model_executor/model_loader/loader.py", line 1077, in load_model
[rank0]:     self._load_weights(model_config, model)
[rank0]:   File "/home/notebook/code/personal/vllm/vllm/model_executor/model_loader/loader.py", line 1031, in _load_weights
[rank0]:     raise ValueError(
[rank0]: ValueError: Parameter model.layers.15.self_attn.qkv_proj.qweight not found in the model.

INT8 405 Model Config:

{
  "_name_or_path": "",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 16384,
  "initializer_range": 0.02,
  "intermediate_size": 53248,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 128,
  "num_hidden_layers": 126,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "quantization_config": {
    "_load_in_4bit": false,
    "_load_in_8bit": true,
    "bnb_4bit_compute_dtype": "float32",
    "bnb_4bit_quant_storage": "uint8",
    "bnb_4bit_quant_type": "fp4",
    "bnb_4bit_use_double_quant": false,
    "llm_int8_enable_fp32_cpu_offload": false,
    "llm_int8_has_fp16_weight": false,
    "llm_int8_skip_modules": null,
    "llm_int8_threshold": 6.0,
    "load_in_4bit": false,
    "load_in_8bit": true,
    "quant_method": "bitsandbytes"
  },
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.44.2",
  "use_cache": true,
  "vocab_size": 128256
}

Sep 18 '24 01:09 junzhang-zj

@chenqianfzh I suspect that there is an error in processing the weights names, I am giving the weight names loaded on hf here, and I am checking the names on vllm.

'model.layers.0.self_attn.q_proj.weight', 'model.layers.0.self_attn.q_proj.SCB', 'model.layers.0.self_attn.q_proj.weight_format', 'model.layers.0.self_attn.k_proj.weight', 'model.layers.0.self_attn.k_proj.SCB', 'model.layers.0.self_attn.k_proj.weight_format', 'model.layers.0.self_attn.v_proj.weight', 'model.layers.0.self_attn.v_proj.SCB', 'model.layers.0.self_attn.v_proj.weight_format', 'model.layers.0.self_attn.o_proj.weight', 'model.layers.0.self_attn.o_proj.SCB', 'model.layers.0.self_attn.o_proj.weight_format', 'model.layers.0.mlp.gate_proj.weight', 'model.layers.0.mlp.gate_proj.SCB', 'model.layers.0.mlp.gate_proj.weight_format', 'model.layers.0.mlp.up_proj.weight', 'model.layers.0.mlp.up_proj.SCB', 'model.layers.0.mlp.up_proj.weight_format', 'model.layers.0.mlp.down_proj.weight', 'model.layers.0.mlp.down_proj.SCB', 'model.layers.0.mlp.down_proj.weight_format', 'model.layers.0.input_layernorm.weight', 'model.layers.0.post_attention_layernorm.weight'

Sep 18 '24 02:09 junzhang-zj

.layers.0.input_layernorm.weight', 'model.layers.0.post_attention_layernorm.weight'

Do you have to use 8bit? If you are OK with 4bit quantization, in-flight quantization is supported. You can try with the param:

load_format="bitsandbytes", quantization="bitsandbytes", enforce_eager=True, tensor_parallel_size=8

The problem is that the model loading will be slow. All the quantized weights are gone when you shut down.

If you need to stick to 8bit prequant model, wonder whether you can share your pre-quant model (for example, upload to your hf repo)? Or you can compare your weight modules with meta-llama/Llama-Guard-3-8B-INT8, which works fine in my test.

Please let me know what is the best choice for you. I will try my very best to help. :-)

Sep 18 '24 06:09 chenqianfzh

@chenqianfzh I should still use int8, because the performance of 405b is very sensitive to quantization.

I checked the read weights (Figure 1, quant_state_dict) and they are all normal,

but the result of model initialization （model.load_weights(qweight_iterator)） seems very strange, as shown in Figure 2.

I will try to upload my pre-quantized to hf, but it may take time because of the large number of parameters.

32411726640479_ pic

32421726640479_ pic

Sep 18 '24 06:09 junzhang-zj

I tested this PR with a trained QLoRA adapter and I am getting this error: KeyError: 'lm_head.qweight' Might this be due to only checking for certain adapter weights? #8434 (comment) Yeah so it seems:
# change so that it works with other models as well
default_target_modules = [
    "gate_proj", "down_proj", "up_proj", "q_proj", "k_proj", "v_proj",
    "o_proj"
]
@jvlinsta I suspect it is an issue specific to some models, as I did not repro this issues with each of my pre-quant test models.

Could you share you model and the params of your command line? I will take a close look.

Hi @chenqianfzh, those are LoRA adapters trained on proprietary data, so cannot share :/ In any case, I resolved it by removing 'lm_head' in my adapter_config.json, and the adapters seem to be working fine now :)

Maybe a related thread on this: https://github.com/artidoro/qlora/issues/13 I also train with FSDP (with under the hood Deepspeed) from accelerate, so maybe that is why.

Anyway, no issue with your PR ;) Rather might be an issue that other folks run into, then they can find a resolution here ^^

Sep 18 '24 06:09 jvlinsta

@chenqianfzh I have pushed to hf, model name is zayjean/405B-INT8-LoRAM, can you test it on your environment and code? I really hope to debug this problem with your help.

Is there any update on this issue? :'(

Sep 18 '24 13:09 junzhang-zj

@chenqianfzh I have pushed to hf, model name is zayjean/405B-INT8-LoRAM, can you test it on your environment and code? I really hope to debug this problem with your help.

Is there any update on this issue? :'(

Thanks for sharing the model.

Oh, it is such a huge model. I tried with my most powerful machine. Could not finish weight tensor initialization, even with the maximum weight cpu-offload.

I took a look at config files in your shared repo. They just look fine.

From the screenshot you shared, your model also looks fine. There is NO 'qkv_proj' weight module, but separate 'q_proj', 'k_proj', 'v_proj', which is correct.

Did you try out in some smaller llama3 models? If yes, did you see the same error?

Sep 19 '24 15:09 chenqianfzh

@chenqianfzh Yes, it seems that at least 8 A100-80G are needed to evaluate this model. I have not tested the small llama3 model yet, because my device is running the HF version of this model evaluation, which is a higher priority experiment for me. When I have free time, I will try the llama3-70B later. I don't know if your environment is enough for evaluation.

Sep 20 '24 01:09 junzhang-zj

vllm vllm copied to clipboard

[Feature][kernel] tensor parallelism with bitsandbytes quantization

vllm
vllm copied to clipboard