vllm Support W8A8 inference in vllm

We have implemented W8A8 inference in vLLM, which can achieve a 30% improvement in throughput. W4A16 quantization methods require weights to be dequantized into fp16 before compute and lead to a throughput drop under heavier load. This PR is part of https://github.com/vllm-project/vllm/pull/1112. We have split the huge PR into two independent parts for easier review. The usage of w8a8 inference is simple(support llama for now):

python ./vllm/entrypoints/api_server.py --model=/path/to/quantized/model --tokenizer=/path/to/tokenizer --max-num-batched-tokens=70000 --block-size=16 --swap-space=20 --quantization smoothquant

Updates!!! We have update the quant method to per token quant for o_proj and down_proj of LLama. Please use the lastest llama-dev branch of smoothquant and per_token_quant branch of torch-int to generate int8 model!!!

You can find more details like how to gernerate int8 weight in original PR https://github.com/vllm-project/vllm/pull/1112 You can use the method with int8 kv cache quant https://github.com/vllm-project/vllm/pull/1507 for best throughput

Oct 30 '23 13:10 AniZpZ

We have implemented W8A8 inference in vLLM, which can achieve a 30% improvement in throughput. W4A16 quantization methods require weights to be dequantized into fp16 before compute and lead to a throughput drop under heavier load. This PR is part of #1112. We have split the huge PR into two independent parts for easier review. The usage of w8a8 inference is simple(support llama for now):
python ./vllm/entrypoints/api_server.py --model=/path/to/quantized/model --tokenizer=/path/to/tokenizer --max-num-batched-tokens=70000 --block-size=16 --swap-space=20 --quantization smoothquant
You can find more details like how to gernerate int8 weight in original PR #1112

Does this PR only support per tensor? Whether to support per token and per channel

Oct 31 '23 03:10 pangr

We have implemented W8A8 inference in vLLM, which can achieve a 30% improvement in throughput. W4A16 quantization methods require weights to be dequantized into fp16 before compute and lead to a throughput drop under heavier load. This PR is part of #1112. We have split the huge PR into two independent parts for easier review. The usage of w8a8 inference is simple(support llama for now):
python ./vllm/entrypoints/api_server.py --model=/path/to/quantized/model --tokenizer=/path/to/tokenizer --max-num-batched-tokens=70000 --block-size=16 --swap-space=20 --quantization smoothquant
You can find more details like how to gernerate int8 weight in original PR #1112
Does this PR only support per tensor? Whether to support per token and per channel

Only support per tensor for now. We may support per channel if we found it make big difference in model performance.

Oct 31 '23 09:10 AniZpZ

I downloaded your vllm branch w8a8 but i faced this case of error. should i add int8LlamForCausalLM in smoothquant ??

ValueError: Model architectures ['Int8LlamaForCausalLM'] are not supported for now. Supported architectures: ['AquilaModel', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'FalconForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'InternLMForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MPTForCausalLM', 'OPTForCausalLM', 'QWenLMHeadModel', 'RWForCausalLM']

Nov 01 '23 05:11 chisnova

I downloaded your vllm branch w8a8 but i faced this case of error. should i add int8LlamForCausalLM in smoothquant ??

ValueError: Model architectures ['Int8LlamaForCausalLM'] are not supported for now. Supported architectures: ['AquilaModel', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'FalconForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'InternLMForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MPTForCausalLM', 'OPTForCausalLM', 'QWenLMHeadModel', 'RWForCausalLM']

Sorry, it's our negligence that there is a naming difference between the smoothquant repo and the vllm repo. Please change the 'architectures' field in the quantized model's config.json from 'Int8LlamaForCausalLM' to 'LlamaForCausalLM'. We will fix this problem soon.

Nov 02 '23 03:11 AniZpZ

Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know there is comment in #1112 about this.

i was running our llama2 model with smoothquant but i faced big performance drop. so if you don't merge yet. it is better to merge in your pr.

Nov 02 '23 05:11 chisnova

Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know there is comment in #1112 about this.

i was running our llama2 model with smoothquant but i faced big performance drop. so if you don't merge yet. it is better to merge in your pr.

We are addressing the performance drop problem with a more solid method. We analyze the activation distribution and are trying to fix the problem with per channel quantization. The job will be done next week. You can try out kv cache quant https://github.com/vllm-project/vllm/pull/1507 if you want to avoid the problem for now. It can lift the throughput by 15% with minial performance loss.

Nov 02 '23 12:11 AniZpZ

@AniZpZ Existing methods (AWQ, GPTQ) go down to 4-bit quantization, saving lots of memory. The speed improvements of 8-bit inference come during inference, which theoretically could be combined AWQ.

Would it be possible to integrate AWQ or GPTQ to go from 4bit quantization to 8bit on-the-fly?

Nov 02 '23 20:11 truenorth8

@AniZpZ hi, thanks for this. I want try this. But could you give some simple quantization guides? I look into the smoothquant repo, its hard for me to quantize new model where i don't how to use my datasets. Before I use AutoAWQ and json format dataset for quantization. Also there are not so much smoothquant models in huggingface(like codellama).

Nov 03 '23 00:11 esmeetu

@AniZpZ Existing methods (AWQ, GPTQ) go down to 4-bit quantization, saving lots of memory. The speed improvements of 8-bit inference come during inference, which theoretically could be combined AWQ.

Would it be possible to integrate AWQ or GPTQ to go from 4bit quantization to 8bit on-the-fly?

The reason why gptq and awq cannot achieve throughput improvement is that they are weight-only quantization. Weight-only quantization requires dequantizing weight to fp16 (or something like bf16) and performing 16bit (or higher) gemm during inference. They cannot benefit from computation reduction.

Nov 03 '23 02:11 AniZpZ

@AniZpZ hi, thanks for this. I want try this. But could you give some simple quantization guides? I look into the smoothquant repo, its hard for me to quantize new model where i don't how to use my datasets. Before I use AutoAWQ and json format dataset for quantization. Also there are not so much smoothquant models in huggingface(like codellama).

First get a normal llama13B model. Then install smoothquant and torch-int for llama. Use "examples/generate_act_scales.py" to generate act scale, and then use "examples/export_int8_llama.py" to export int8 model. Please note to check and change the 'architectures' field in the model's config.json from 'Int8LlamaForCausalLM' to 'LlamaForCausalLM'.

Nov 03 '23 02:11 AniZpZ

@AniZpZ Thanks for your quick reply! The generate_act_scales.py need datasets param. Default dataset is not for code model. how to load custom json format dataset?

Nov 03 '23 04:11 esmeetu

@AniZpZ Thanks for your quick reply! The generate_act_scales.py need datasets param. Default dataset is not for code model. how to load custom json format dataset?

We use HF Datasets lib for calibrate data loading. You can check the usage of HF Datasets to load your custom data.

Nov 03 '23 08:11 AniZpZ

thanks for the PR! I tried to run this with a quantized Llama 2 70B model on A100 but loading seems to be extremely slow (server appears to hang but GPU memory usage does grow very slowly). do you happen to have some insights?

Nov 06 '23 23:11 yunfeng-scale

Were you able to run a Llama 2 70B model with the current branch? I can get https://github.com/vllm-project/vllm/pull/1112 running with it but I think there's some merge fixes to be done in the current PR.

Nov 07 '23 05:11 yunfeng-scale

Were you able to run a Llama 2 70B model with the current branch? I can get #1112 running with it but I think there's some merge fixes to be done in the current PR.

The branch hasn't been tested with Llama 2 70B yet. I will try to run Llama 2 70B with current branch and fix the potential problems. I will update the code and notify you asap.

Nov 07 '23 05:11 AniZpZ

try to run int8 model with w8a8 via api_server.py, when i call api_client.py got error

File "/vllm/vllm/model_executor/layers/attention.py", line 255, in forward cache_ops.reshape_and_cache( RuntimeError: expected scalar type Long but found Int

server could run, while can't deal with call from client

Nov 09 '23 05:11 dengzheng-cloud

Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know there is comment in #1112 about this. i was running our llama2 model with smoothquant but i faced big performance drop. so if you don't merge yet. it is better to merge in your pr.

We are addressing the performance drop problem with a more solid method. We analyze the activation distribution and are trying to fix the problem with per channel quantization. The job will be done next week. You can try out kv cache quant #1507 if you want to avoid the problem for now. It can lift the throughput by 15% with minial performance loss.

Will w8a8 partial release this week?

Nov 09 '23 09:11 gesanqiu

Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know there is comment in #1112 about this. i was running our llama2 model with smoothquant but i faced big performance drop. so if you don't merge yet. it is better to merge in your pr.

We are addressing the performance drop problem with a more solid method. We analyze the activation distribution and are trying to fix the problem with per channel quantization. The job will be done next week. You can try out kv cache quant #1507 if you want to avoid the problem for now. It can lift the throughput by 15% with minial performance loss.

Will w8a8 partial release this week?

We are doing experiments comparing partial and per-token activation quantization. We will release the experiment results this Friday and release partial-quant or per-token-quant next Monday. The released method depends on our experiment result.

Nov 09 '23 10:11 AniZpZ

try to run int8 model with w8a8 via api_server.py, when i call api_client.py got error

File "/vllm/vllm/model_executor/layers/attention.py", line 255, in forward cache_ops.reshape_and_cache( RuntimeError: expected scalar type Long but found Int

server could run, while can't deal with call from client

I am fixing the problem, you can test with the branch at https://github.com/vllm-project/vllm/pull/1112 first. There are some things to do to adapt to the latest version.

Nov 09 '23 10:11 AniZpZ

Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know there is comment in #1112 about this. i was running our llama2 model with smoothquant but i faced big performance drop. so if you don't merge yet. it is better to merge in your pr.

We are addressing the performance drop problem with a more solid method. We analyze the activation distribution and are trying to fix the problem with per channel quantization. The job will be done next week. You can try out kv cache quant #1507 if you want to avoid the problem for now. It can lift the throughput by 15% with minial performance loss.

Will w8a8 partial release this week?

If first token generation time is your primary consideration, you can test our kv cache quant branch first. It can significantly reduce the first token generation time.

Nov 09 '23 10:11 AniZpZ

Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know there is comment in #1112 about this. i was running our llama2 model with smoothquant but i faced big performance drop. so if you don't merge yet. it is better to merge in your pr.

We are addressing the performance drop problem with a more solid method. We analyze the activation distribution and are trying to fix the problem with per channel quantization. The job will be done next week. You can try out kv cache quant #1507 if you want to avoid the problem for now. It can lift the throughput by 15% with minial performance loss.

Will w8a8 partial release this week?

@yunfeng-scale @gesanqiu Hello! Here, we provide the experiment results based on vllm 0.1.7 for llama-13b including: w8a8, two partial quant methods and two per-token quant methods.

w8a8: both activation and weight use per-tensor quant partial quant 1: only down_proj uses fp16 partial quant 2: both o_proj and down_proj use fp16 per-token quant 1: only down_proj uses per-token quant per-token quant 2: both o_proj and down_proj use quant

Throughput

Model	Throughput (tokens/s)	Percentage Increase
fp16	528.2081	+0%
w8a8 + kv-quant	797.8306	+50.95%
partial quant 1 + kv-quant	740.8924	+40.32%
partial quant 2 + kv-quant	733.9158	+38.82%
per-token quant 1 + kv-quant	793.8667	+50.19%
per-token quant 2 + kv-quant	775.3904	+46.78%

MMLU scores

model	STEM	Social Sciences	Humanities	Other	Average
fp16	0.4056	0.4965	0.4750	0.4855	0.4657
w8a8 + kv-quant	0.2584	0.2559	0.2509	0.2589	0.2570
partial quant 1 + kv-quant	0.3955	0.4806	0.4666	0.4805	0.4558
partial quant 2 + kv-quant	0.3987	0.4810	0.4712	0.4765	0.4568
per-token quant 1 + kv-quant	0.3827	0.4663	0.4369	0.4618	0.4369
per-token quant 2 + kv-quant	0.3728	0.4714	0.4500	0.4833	0.4443

As you see, per-token quant can realize better performance. Compared with fp16, per-token quant 2 + kv-quant only decreases about 0.02 MMLU score on average, but throughput increases about 50%. We will release per-token quant code and merge it with the latest vllm version soon.

Nov 10 '23 09:11 HandH1998

Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know there is comment in #1112 about this. i was running our llama2 model with smoothquant but i faced big performance drop. so if you don't merge yet. it is better to merge in your pr.

We are addressing the performance drop problem with a more solid method. We analyze the activation distribution and are trying to fix the problem with per channel quantization. The job will be done next week. You can try out kv cache quant #1507 if you want to avoid the problem for now. It can lift the throughput by 15% with minial performance loss.

Will w8a8 partial release this week?

Hello! Here, we provide the test results of two partial quant methods for llama-13b: partial quant 1 (only down_proj in fp16) and partial quant 2 (o_proj and down_proj in fp16). Note that the two methods both use int8 kv-quant. Throughput Compared to fp16, partial quant 1 lifts the throughput by about 40%, and partial quant 2 lifts it by about 30%. MMLU scores

model STEM Social Sciences Humanities Other Average fp16 0.4056 0.4965 0.4750 0.4855 0.4657 partial quant 1 0.3937 0.4080 0.4163 0.4191 0.4093 partial quant 2 0.4009 0.4796 0.4575 0.4766 0.4537 We will release the partial quant 2 code next week.

As for dynamic activation per-token quant, we apply it in down_proj quant. But it helps a little with only 1.8% throughput increment in our experiments.

We have figured out a method whereby per-token quantization is faster than partial quantization and has similar model performance.

Nov 13 '23 13:11 AniZpZ

Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know there is comment in #1112 about this. i was running our llama2 model with smoothquant but i faced big performance drop. so if you don't merge yet. it is better to merge in your pr.

We are addressing the performance drop problem with a more solid method. We analyze the activation distribution and are trying to fix the problem with per channel quantization. The job will be done next week. You can try out kv cache quant #1507 if you want to avoid the problem for now. It can lift the throughput by 15% with minial performance loss.

Will w8a8 partial release this week?

@yunfeng-scale @gesanqiu Hello! Here, we provide the experiment results based on vllm 0.1.7 for llama-13b including: w8a8, two partial quant methods and two per-token quant methods.

w8a8: both activation and weight use per-tensor quant partial quant 1: only down_proj uses fp16 partial quant 2: both o_proj and down_proj use fp16 per-token quant 1: only down_proj uses per-token quant per-token quant 2: both o_proj and down_proj use quant

Throughput

Model Throughput (tokens/s) Percentage Increase fp16 528.2081 +0% w8a8 + kv-quant 797.8306 +50.95% partial quant 1 + kv-quant 740.8924 +40.32% partial quant 2 + kv-quant 733.9158 +38.82% per-token quant 1 + kv-quant 793.8667 +50.19% per-token quant 2 + kv-quant 775.3904 +46.78% MMLU scores

model STEM Social Sciences Humanities Other Average fp16 0.4056 0.4965 0.4750 0.4855 0.4657 w8a8 + kv-quant 0.2584 0.2559 0.2509 0.2589 0.2570 partial quant 1 + kv-quant 0.3955 0.4806 0.4666 0.4805 0.4558 partial quant 2 + kv-quant 0.3987 0.4810 0.4712 0.4765 0.4568 per-token quant 1 + kv-quant 0.3827 0.4663 0.4369 0.4618 0.4369 per-token quant 2 + kv-quant 0.3728 0.4714 0.4500 0.4833 0.4443 As you see, per-token quant can realize better performance. Compared with fp16, per-token quant 2 + kv-quant only decreases about 0.02 MMLU score on average, but throughput increases about 50%. We will release per-token quant code and merge it with the latest vllm version soon.

Hi, when will this be released? Can we have partial quant 1 and partial quant 2 first?

Nov 21 '23 05:11 gesanqiu

Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know there is comment in #1112 about this. i was running our llama2 model with smoothquant but i faced big performance drop. so if you don't merge yet. it is better to merge in your pr.

We are addressing the performance drop problem with a more solid method. We analyze the activation distribution and are trying to fix the problem with per channel quantization. The job will be done next week. You can try out kv cache quant #1507 if you want to avoid the problem for now. It can lift the throughput by 15% with minial performance loss.

Will w8a8 partial release this week?

@yunfeng-scale @gesanqiu Hello! Here, we provide the experiment results based on vllm 0.1.7 for llama-13b including: w8a8, two partial quant methods and two per-token quant methods. w8a8: both activation and weight use per-tensor quant partial quant 1: only down_proj uses fp16 partial quant 2: both o_proj and down_proj use fp16 per-token quant 1: only down_proj uses per-token quant per-token quant 2: both o_proj and down_proj use quant Throughput Model Throughput (tokens/s) Percentage Increase fp16 528.2081 +0% w8a8 + kv-quant 797.8306 +50.95% partial quant 1 + kv-quant 740.8924 +40.32% partial quant 2 + kv-quant 733.9158 +38.82% per-token quant 1 + kv-quant 793.8667 +50.19% per-token quant 2 + kv-quant 775.3904 +46.78% MMLU scores model STEM Social Sciences Humanities Other Average fp16 0.4056 0.4965 0.4750 0.4855 0.4657 w8a8 + kv-quant 0.2584 0.2559 0.2509 0.2589 0.2570 partial quant 1 + kv-quant 0.3955 0.4806 0.4666 0.4805 0.4558 partial quant 2 + kv-quant 0.3987 0.4810 0.4712 0.4765 0.4568 per-token quant 1 + kv-quant 0.3827 0.4663 0.4369 0.4618 0.4369 per-token quant 2 + kv-quant 0.3728 0.4714 0.4500 0.4833 0.4443 As you see, per-token quant can realize better performance. Compared with fp16, per-token quant 2 + kv-quant only decreases about 0.02 MMLU score on average, but throughput increases about 50%. We will release per-token quant code and merge it with the latest vllm version soon.

Hi, when will this be released? Can we have partial quant 1 and partial quant 2 first?

We try to release w8a8 per-token quant this evening, as there are still many merge conflicts to solve.

Nov 21 '23 06:11 HandH1998

Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know there is comment in #1112 about this. i was running our llama2 model with smoothquant but i faced big performance drop. so if you don't merge yet. it is better to merge in your pr.

We are addressing the performance drop problem with a more solid method. We analyze the activation distribution and are trying to fix the problem with per channel quantization. The job will be done next week. You can try out kv cache quant #1507 if you want to avoid the problem for now. It can lift the throughput by 15% with minial performance loss.

Will w8a8 partial release this week?

@yunfeng-scale @gesanqiu Hello! Here, we provide the experiment results based on vllm 0.1.7 for llama-13b including: w8a8, two partial quant methods and two per-token quant methods. w8a8: both activation and weight use per-tensor quant partial quant 1: only down_proj uses fp16 partial quant 2: both o_proj and down_proj use fp16 per-token quant 1: only down_proj uses per-token quant per-token quant 2: both o_proj and down_proj use quant Throughput Model Throughput (tokens/s) Percentage Increase fp16 528.2081 +0% w8a8 + kv-quant 797.8306 +50.95% partial quant 1 + kv-quant 740.8924 +40.32% partial quant 2 + kv-quant 733.9158 +38.82% per-token quant 1 + kv-quant 793.8667 +50.19% per-token quant 2 + kv-quant 775.3904 +46.78% MMLU scores model STEM Social Sciences Humanities Other Average fp16 0.4056 0.4965 0.4750 0.4855 0.4657 w8a8 + kv-quant 0.2584 0.2559 0.2509 0.2589 0.2570 partial quant 1 + kv-quant 0.3955 0.4806 0.4666 0.4805 0.4558 partial quant 2 + kv-quant 0.3987 0.4810 0.4712 0.4765 0.4568 per-token quant 1 + kv-quant 0.3827 0.4663 0.4369 0.4618 0.4369 per-token quant 2 + kv-quant 0.3728 0.4714 0.4500 0.4833 0.4443 As you see, per-token quant can realize better performance. Compared with fp16, per-token quant 2 + kv-quant only decreases about 0.02 MMLU score on average, but throughput increases about 50%. We will release per-token quant code and merge it with the latest vllm version soon.

Hi, when will this be released? Can we have partial quant 1 and partial quant 2 first?

We have released per-token quant code based vllm 0.2.1. Welcome to try it!

Nov 22 '23 12:11 HandH1998

@yunfeng-scale Hello there~ We have fixed the loading and running problems in v0.2.1, and applied per-token quantization on o_proj and down_proj, which leads to better model performance. Could you please take some time out of your busy schedule to review our code?

Nov 22 '23 12:11 AniZpZ

Hi there, I tested this patch on A40 card with codellama-13b, and benchmark results show as follows:

benchmark_throughput(include results from #1112 )

Model	Throughput(tokens/s)	Percentage Increse
FP16	742.00	+0%
KV_INT8	1146.75	+54.5%
W8A8	1219.20	+64.3%
W8A8 + KV_INT8	1648.25	+122.1%
W8A8 per-token	1218.42	+64.2%
W4A16(AWQ)	857.16	+15.5%

notice that the throughput of awq model is tested with max_num_seqs=64, in my test case it may be bigger when max_num_seqs in [64, 96].

benchmark_latency

Batch size	Input Length	Output Length	FP16(s)	W8A8 per-token(s)	AWQ(s)
1	128	32	1.5795	0.9617	0.6867
8	128	32	1.9444	1.1977	1.4441
16	128	32	2.3071	1.4768	2.311
32	128	32	3.2188	2.0462	4.2129
64	128	32	6.828	3.1846	-
32	256	32	5.2961	2.9724	7.5002
32	512	64	16.3069	7.7758	18.1441
32	1024	128	49.6551	22.9284	44.2433

Tested lots of cases, only update part of them.
Results show W8A8 per-token quantization has a better performance under large batch size and large context scenario.
vLLM has a memory manage issue on awq model, occur OOM when batch_size=64.

prefilling latenct and decoding latency results under batch_size =1

Input Length	FP16 prefill(ms/token)	FP16 decode(ms/token)	W8A8 per-token prefill(ms/token)	W8A8 per-token decode(ms/token)	AWQ prefill(ms/token)	AWQ decode(ms/token)
128	58	28	36	29	92	19
256	79	49	47	29	180	19
512	134	49	84	30	357	19
1024	274	50	166	30	708	19
2048	531	51	333	32	1436	20

smoothquant has a better performance on prefilling phase.

HumanEval results show as follows:

model	MBPP(Code)	HumanEval_x	HumanEval_CPP	HumanEval_Java	HumanEval_Python	Human_Eval_Avg	Multiple_Humaneval_Python	Multiple_Humaneval_CPP	Multiple_Humaneval_Java	Multiple_Humaneval_Js	Multiple_Humaneval_Avg
FP16	0.619	0.416	0.512	0.5	0.512	0.518	0.54	0.404	0.392	0.509	0.376
W8A8-per-token	0.556	0.369	0.384	0.451	0.5	0.476	0.54	0.342	0.348	0.441	0.328
W4A16G128	0.545	0.361	0.39	0.445	0.5	0.476	0.466	0.348	0.354	0.416	0.326

It shows that smoothquant decrese about 0.05 HumanEval score than fp16, but better than awq.

Nov 23 '23 10:11 gesanqiu

Hi, Is this per-token quantization patch only support single card?

I tested this patch on A10 with llama2-7b, there is no problem if I run with single card. But if I run with -tp 2, the result seems strange.

Maybe the per_token scale factor should applied before all_reduce, and re-caculate the new per_token scale factor after all_reduce?

def forward(self, x):
    gate_up, _ = self.gate_up_proj(x)
    scale = None
    if self.use_int8:
        # TODO: currently gate up share same scale, use seperate scales
        x, *scale = self.act_fn(gate_up)
    else:
        x = self.act_fn(gate_up)
    ## apply scale before all_reduce ? 
    x, _ = self.down_proj(x)
    ## update scale after all_reduce ? 
    return x, scale

Nov 30 '23 07:11 qiaoxj07

Hi, Is this per-token quantization patch only support single card?

I tested this patch on A10 with llama2-7b, there is no problem if I run with single card. But if I run with -tp 2, the result seems strange.

Maybe the per_token scale factor should applied before all_reduce, and re-caculate the new per_token scale factor after all_reduce?
def forward(self, x):
    gate_up, _ = self.gate_up_proj(x)
    scale = None
    if self.use_int8:
        # TODO: currently gate up share same scale, use seperate scales
        x, *scale = self.act_fn(gate_up)
    else:
        x = self.act_fn(gate_up)
    ## apply scale before all_reduce ? 
    x, _ = self.down_proj(x)
    ## update scale after all_reduce ? 
    return x, scale

Thanks for your issue! We have confirmed it is a bug, as we didn't take TP into account when designing per-token quant. The correct way is to do per-token dequant before all_reduce. We will fix it soon!

Nov 30 '23 09:11 HandH1998

Hi, Is this per-token quantization patch only support single card?

I tested this patch on A10 with llama2-7b, there is no problem if I run with single card. But if I run with -tp 2, the result seems strange.

Maybe the per_token scale factor should applied before all_reduce, and re-caculate the new per_token scale factor after all_reduce?
def forward(self, x):
    gate_up, _ = self.gate_up_proj(x)
    scale = None
    if self.use_int8:
        # TODO: currently gate up share same scale, use seperate scales
        x, *scale = self.act_fn(gate_up)
    else:
        x = self.act_fn(gate_up)
    ## apply scale before all_reduce ? 
    x, _ = self.down_proj(x)
    ## update scale after all_reduce ? 
    return x, scale

We have update the code and fix tp problem when enable per token quant. Please pull the code and try again.

Dec 01 '23 11:12 AniZpZ

vllm vllm copied to clipboard

Support W8A8 inference in vllm

vllm
vllm copied to clipboard