vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Support W8A8 inference in vllm

Open AniZpZ opened this issue 2 years ago • 53 comments

We have implemented W8A8 inference in vLLM, which can achieve a 30% improvement in throughput. W4A16 quantization methods require weights to be dequantized into fp16 before compute and lead to a throughput drop under heavier load. This PR is part of https://github.com/vllm-project/vllm/pull/1112. We have split the huge PR into two independent parts for easier review. The usage of w8a8 inference is simple(support llama for now):

python ./vllm/entrypoints/api_server.py --model=/path/to/quantized/model --tokenizer=/path/to/tokenizer --max-num-batched-tokens=70000 --block-size=16 --swap-space=20 --quantization smoothquant

Updates!!! We have update the quant method to per token quant for o_proj and down_proj of LLama. Please use the lastest llama-dev branch of smoothquant and per_token_quant branch of torch-int to generate int8 model!!!

You can find more details like how to gernerate int8 weight in original PR https://github.com/vllm-project/vllm/pull/1112 You can use the method with int8 kv cache quant https://github.com/vllm-project/vllm/pull/1507 for best throughput

AniZpZ avatar Oct 30 '23 13:10 AniZpZ

We have implemented W8A8 inference in vLLM, which can achieve a 30% improvement in throughput. W4A16 quantization methods require weights to be dequantized into fp16 before compute and lead to a throughput drop under heavier load. This PR is part of #1112. We have split the huge PR into two independent parts for easier review. The usage of w8a8 inference is simple(support llama for now):

python ./vllm/entrypoints/api_server.py --model=/path/to/quantized/model --tokenizer=/path/to/tokenizer --max-num-batched-tokens=70000 --block-size=16 --swap-space=20 --quantization smoothquant

You can find more details like how to gernerate int8 weight in original PR #1112

Does this PR only support per tensor? Whether to support per token and per channel

pangr avatar Oct 31 '23 03:10 pangr

We have implemented W8A8 inference in vLLM, which can achieve a 30% improvement in throughput. W4A16 quantization methods require weights to be dequantized into fp16 before compute and lead to a throughput drop under heavier load. This PR is part of #1112. We have split the huge PR into two independent parts for easier review. The usage of w8a8 inference is simple(support llama for now):

python ./vllm/entrypoints/api_server.py --model=/path/to/quantized/model --tokenizer=/path/to/tokenizer --max-num-batched-tokens=70000 --block-size=16 --swap-space=20 --quantization smoothquant

You can find more details like how to gernerate int8 weight in original PR #1112

Does this PR only support per tensor? Whether to support per token and per channel

Only support per tensor for now. We may support per channel if we found it make big difference in model performance.

AniZpZ avatar Oct 31 '23 09:10 AniZpZ

I downloaded your vllm branch w8a8 but i faced this case of error. should i add int8LlamForCausalLM in smoothquant ??

ValueError: Model architectures ['Int8LlamaForCausalLM'] are not supported for now. Supported architectures: ['AquilaModel', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'FalconForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'InternLMForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MPTForCausalLM', 'OPTForCausalLM', 'QWenLMHeadModel', 'RWForCausalLM']

chisnova avatar Nov 01 '23 05:11 chisnova

I downloaded your vllm branch w8a8 but i faced this case of error. should i add int8LlamForCausalLM in smoothquant ??

ValueError: Model architectures ['Int8LlamaForCausalLM'] are not supported for now. Supported architectures: ['AquilaModel', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'FalconForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'InternLMForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MPTForCausalLM', 'OPTForCausalLM', 'QWenLMHeadModel', 'RWForCausalLM']

Sorry, it's our negligence that there is a naming difference between the smoothquant repo and the vllm repo. Please change the 'architectures' field in the quantized model's config.json from 'Int8LlamaForCausalLM' to 'LlamaForCausalLM'. We will fix this problem soon.

AniZpZ avatar Nov 02 '23 03:11 AniZpZ

Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know there is comment in #1112 about this.

i was running our llama2 model with smoothquant but i faced big performance drop. so if you don't merge yet. it is better to merge in your pr.

chisnova avatar Nov 02 '23 05:11 chisnova

Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know there is comment in #1112 about this.

i was running our llama2 model with smoothquant but i faced big performance drop. so if you don't merge yet. it is better to merge in your pr.

We are addressing the performance drop problem with a more solid method. We analyze the activation distribution and are trying to fix the problem with per channel quantization. The job will be done next week. You can try out kv cache quant https://github.com/vllm-project/vllm/pull/1507 if you want to avoid the problem for now. It can lift the throughput by 15% with minial performance loss.

AniZpZ avatar Nov 02 '23 12:11 AniZpZ

@AniZpZ Existing methods (AWQ, GPTQ) go down to 4-bit quantization, saving lots of memory. The speed improvements of 8-bit inference come during inference, which theoretically could be combined AWQ.

Would it be possible to integrate AWQ or GPTQ to go from 4bit quantization to 8bit on-the-fly?

truenorth8 avatar Nov 02 '23 20:11 truenorth8

@AniZpZ hi, thanks for this. I want try this. But could you give some simple quantization guides? I look into the smoothquant repo, its hard for me to quantize new model where i don't how to use my datasets. Before I use AutoAWQ and json format dataset for quantization. Also there are not so much smoothquant models in huggingface(like codellama).

esmeetu avatar Nov 03 '23 00:11 esmeetu

@AniZpZ Existing methods (AWQ, GPTQ) go down to 4-bit quantization, saving lots of memory. The speed improvements of 8-bit inference come during inference, which theoretically could be combined AWQ.

Would it be possible to integrate AWQ or GPTQ to go from 4bit quantization to 8bit on-the-fly?

The reason why gptq and awq cannot achieve throughput improvement is that they are weight-only quantization. Weight-only quantization requires dequantizing weight to fp16 (or something like bf16) and performing 16bit (or higher) gemm during inference. They cannot benefit from computation reduction.

AniZpZ avatar Nov 03 '23 02:11 AniZpZ

@AniZpZ hi, thanks for this. I want try this. But could you give some simple quantization guides? I look into the smoothquant repo, its hard for me to quantize new model where i don't how to use my datasets. Before I use AutoAWQ and json format dataset for quantization. Also there are not so much smoothquant models in huggingface(like codellama).

First get a normal llama13B model. Then install smoothquant and torch-int for llama. Use "examples/generate_act_scales.py" to generate act scale, and then use "examples/export_int8_llama.py" to export int8 model. Please note to check and change the 'architectures' field in the model's config.json from 'Int8LlamaForCausalLM' to 'LlamaForCausalLM'.

AniZpZ avatar Nov 03 '23 02:11 AniZpZ

@AniZpZ Thanks for your quick reply! The generate_act_scales.py need datasets param. Default dataset is not for code model. how to load custom json format dataset?

esmeetu avatar Nov 03 '23 04:11 esmeetu

@AniZpZ Thanks for your quick reply! The generate_act_scales.py need datasets param. Default dataset is not for code model. how to load custom json format dataset?

We use HF Datasets lib for calibrate data loading. You can check the usage of HF Datasets to load your custom data.

AniZpZ avatar Nov 03 '23 08:11 AniZpZ

thanks for the PR! I tried to run this with a quantized Llama 2 70B model on A100 but loading seems to be extremely slow (server appears to hang but GPU memory usage does grow very slowly). do you happen to have some insights?

yunfeng-scale avatar Nov 06 '23 23:11 yunfeng-scale

Were you able to run a Llama 2 70B model with the current branch? I can get https://github.com/vllm-project/vllm/pull/1112 running with it but I think there's some merge fixes to be done in the current PR.

yunfeng-scale avatar Nov 07 '23 05:11 yunfeng-scale

Were you able to run a Llama 2 70B model with the current branch? I can get #1112 running with it but I think there's some merge fixes to be done in the current PR.

The branch hasn't been tested with Llama 2 70B yet. I will try to run Llama 2 70B with current branch and fix the potential problems. I will update the code and notify you asap.

AniZpZ avatar Nov 07 '23 05:11 AniZpZ

try to run int8 model with w8a8 via api_server.py, when i call api_client.py got error

File "/vllm/vllm/model_executor/layers/attention.py", line 255, in forward cache_ops.reshape_and_cache( RuntimeError: expected scalar type Long but found Int

server could run, while can't deal with call from client

dengzheng-cloud avatar Nov 09 '23 05:11 dengzheng-cloud

Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know there is comment in #1112 about this. i was running our llama2 model with smoothquant but i faced big performance drop. so if you don't merge yet. it is better to merge in your pr.

We are addressing the performance drop problem with a more solid method. We analyze the activation distribution and are trying to fix the problem with per channel quantization. The job will be done next week. You can try out kv cache quant #1507 if you want to avoid the problem for now. It can lift the throughput by 15% with minial performance loss.

Will w8a8 partial release this week?

gesanqiu avatar Nov 09 '23 09:11 gesanqiu

Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know there is comment in #1112 about this. i was running our llama2 model with smoothquant but i faced big performance drop. so if you don't merge yet. it is better to merge in your pr.

We are addressing the performance drop problem with a more solid method. We analyze the activation distribution and are trying to fix the problem with per channel quantization. The job will be done next week. You can try out kv cache quant #1507 if you want to avoid the problem for now. It can lift the throughput by 15% with minial performance loss.

Will w8a8 partial release this week?

We are doing experiments comparing partial and per-token activation quantization. We will release the experiment results this Friday and release partial-quant or per-token-quant next Monday. The released method depends on our experiment result.

AniZpZ avatar Nov 09 '23 10:11 AniZpZ

try to run int8 model with w8a8 via api_server.py, when i call api_client.py got error

File "/vllm/vllm/model_executor/layers/attention.py", line 255, in forward cache_ops.reshape_and_cache( RuntimeError: expected scalar type Long but found Int

server could run, while can't deal with call from client

I am fixing the problem, you can test with the branch at https://github.com/vllm-project/vllm/pull/1112 first. There are some things to do to adapt to the latest version.

AniZpZ avatar Nov 09 '23 10:11 AniZpZ

Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know there is comment in #1112 about this. i was running our llama2 model with smoothquant but i faced big performance drop. so if you don't merge yet. it is better to merge in your pr.

We are addressing the performance drop problem with a more solid method. We analyze the activation distribution and are trying to fix the problem with per channel quantization. The job will be done next week. You can try out kv cache quant #1507 if you want to avoid the problem for now. It can lift the throughput by 15% with minial performance loss.

Will w8a8 partial release this week?

If first token generation time is your primary consideration, you can test our kv cache quant branch first. It can significantly reduce the first token generation time.

AniZpZ avatar Nov 09 '23 10:11 AniZpZ

Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know there is comment in #1112 about this. i was running our llama2 model with smoothquant but i faced big performance drop. so if you don't merge yet. it is better to merge in your pr.

We are addressing the performance drop problem with a more solid method. We analyze the activation distribution and are trying to fix the problem with per channel quantization. The job will be done next week. You can try out kv cache quant #1507 if you want to avoid the problem for now. It can lift the throughput by 15% with minial performance loss.

Will w8a8 partial release this week?

@yunfeng-scale @gesanqiu Hello! Here, we provide the experiment results based on vllm 0.1.7 for llama-13b including: w8a8, two partial quant methods and two per-token quant methods.

w8a8: both activation and weight use per-tensor quant partial quant 1: only down_proj uses fp16 partial quant 2: both o_proj and down_proj use fp16 per-token quant 1: only down_proj uses per-token quant per-token quant 2: both o_proj and down_proj use quant

Throughput

Model Throughput (tokens/s) Percentage Increase
fp16 528.2081 +0%
w8a8 + kv-quant 797.8306 +50.95%
partial quant 1 + kv-quant 740.8924 +40.32%
partial quant 2 + kv-quant 733.9158 +38.82%
per-token quant 1 + kv-quant 793.8667 +50.19%
per-token quant 2 + kv-quant 775.3904 +46.78%

MMLU scores

model STEM Social Sciences Humanities Other Average
fp16 0.4056 0.4965 0.4750 0.4855 0.4657
w8a8 + kv-quant 0.2584 0.2559 0.2509 0.2589 0.2570
partial quant 1 + kv-quant 0.3955 0.4806 0.4666 0.4805 0.4558
partial quant 2 + kv-quant 0.3987 0.4810 0.4712 0.4765 0.4568
per-token quant 1 + kv-quant 0.3827 0.4663 0.4369 0.4618 0.4369
per-token quant 2 + kv-quant 0.3728 0.4714 0.4500 0.4833 0.4443

As you see, per-token quant can realize better performance. Compared with fp16, per-token quant 2 + kv-quant only decreases about 0.02 MMLU score on average, but throughput increases about 50%. We will release per-token quant code and merge it with the latest vllm version soon.

HandH1998 avatar Nov 10 '23 09:11 HandH1998

Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know there is comment in #1112 about this. i was running our llama2 model with smoothquant but i faced big performance drop. so if you don't merge yet. it is better to merge in your pr.

We are addressing the performance drop problem with a more solid method. We analyze the activation distribution and are trying to fix the problem with per channel quantization. The job will be done next week. You can try out kv cache quant #1507 if you want to avoid the problem for now. It can lift the throughput by 15% with minial performance loss.

Will w8a8 partial release this week?

Hello! Here, we provide the test results of two partial quant methods for llama-13b: partial quant 1 (only down_proj in fp16) and partial quant 2 (o_proj and down_proj in fp16). Note that the two methods both use int8 kv-quant. Throughput Compared to fp16, partial quant 1 lifts the throughput by about 40%, and partial quant 2 lifts it by about 30%. MMLU scores

model STEM Social Sciences Humanities Other Average fp16 0.4056 0.4965 0.4750 0.4855 0.4657 partial quant 1 0.3937 0.4080 0.4163 0.4191 0.4093 partial quant 2 0.4009 0.4796 0.4575 0.4766 0.4537 We will release the partial quant 2 code next week.

As for dynamic activation per-token quant, we apply it in down_proj quant. But it helps a little with only 1.8% throughput increment in our experiments.

We have figured out a method whereby per-token quantization is faster than partial quantization and has similar model performance.

AniZpZ avatar Nov 13 '23 13:11 AniZpZ

Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know there is comment in #1112 about this. i was running our llama2 model with smoothquant but i faced big performance drop. so if you don't merge yet. it is better to merge in your pr.

We are addressing the performance drop problem with a more solid method. We analyze the activation distribution and are trying to fix the problem with per channel quantization. The job will be done next week. You can try out kv cache quant #1507 if you want to avoid the problem for now. It can lift the throughput by 15% with minial performance loss.

Will w8a8 partial release this week?

@yunfeng-scale @gesanqiu Hello! Here, we provide the experiment results based on vllm 0.1.7 for llama-13b including: w8a8, two partial quant methods and two per-token quant methods.

w8a8: both activation and weight use per-tensor quant partial quant 1: only down_proj uses fp16 partial quant 2: both o_proj and down_proj use fp16 per-token quant 1: only down_proj uses per-token quant per-token quant 2: both o_proj and down_proj use quant

Throughput

Model Throughput (tokens/s) Percentage Increase fp16 528.2081 +0% w8a8 + kv-quant 797.8306 +50.95% partial quant 1 + kv-quant 740.8924 +40.32% partial quant 2 + kv-quant 733.9158 +38.82% per-token quant 1 + kv-quant 793.8667 +50.19% per-token quant 2 + kv-quant 775.3904 +46.78% MMLU scores

model STEM Social Sciences Humanities Other Average fp16 0.4056 0.4965 0.4750 0.4855 0.4657 w8a8 + kv-quant 0.2584 0.2559 0.2509 0.2589 0.2570 partial quant 1 + kv-quant 0.3955 0.4806 0.4666 0.4805 0.4558 partial quant 2 + kv-quant 0.3987 0.4810 0.4712 0.4765 0.4568 per-token quant 1 + kv-quant 0.3827 0.4663 0.4369 0.4618 0.4369 per-token quant 2 + kv-quant 0.3728 0.4714 0.4500 0.4833 0.4443 As you see, per-token quant can realize better performance. Compared with fp16, per-token quant 2 + kv-quant only decreases about 0.02 MMLU score on average, but throughput increases about 50%. We will release per-token quant code and merge it with the latest vllm version soon.

Hi, when will this be released? Can we have partial quant 1 and partial quant 2 first?

gesanqiu avatar Nov 21 '23 05:11 gesanqiu

Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know there is comment in #1112 about this. i was running our llama2 model with smoothquant but i faced big performance drop. so if you don't merge yet. it is better to merge in your pr.

We are addressing the performance drop problem with a more solid method. We analyze the activation distribution and are trying to fix the problem with per channel quantization. The job will be done next week. You can try out kv cache quant #1507 if you want to avoid the problem for now. It can lift the throughput by 15% with minial performance loss.

Will w8a8 partial release this week?

@yunfeng-scale @gesanqiu Hello! Here, we provide the experiment results based on vllm 0.1.7 for llama-13b including: w8a8, two partial quant methods and two per-token quant methods. w8a8: both activation and weight use per-tensor quant partial quant 1: only down_proj uses fp16 partial quant 2: both o_proj and down_proj use fp16 per-token quant 1: only down_proj uses per-token quant per-token quant 2: both o_proj and down_proj use quant Throughput Model Throughput (tokens/s) Percentage Increase fp16 528.2081 +0% w8a8 + kv-quant 797.8306 +50.95% partial quant 1 + kv-quant 740.8924 +40.32% partial quant 2 + kv-quant 733.9158 +38.82% per-token quant 1 + kv-quant 793.8667 +50.19% per-token quant 2 + kv-quant 775.3904 +46.78% MMLU scores model STEM Social Sciences Humanities Other Average fp16 0.4056 0.4965 0.4750 0.4855 0.4657 w8a8 + kv-quant 0.2584 0.2559 0.2509 0.2589 0.2570 partial quant 1 + kv-quant 0.3955 0.4806 0.4666 0.4805 0.4558 partial quant 2 + kv-quant 0.3987 0.4810 0.4712 0.4765 0.4568 per-token quant 1 + kv-quant 0.3827 0.4663 0.4369 0.4618 0.4369 per-token quant 2 + kv-quant 0.3728 0.4714 0.4500 0.4833 0.4443 As you see, per-token quant can realize better performance. Compared with fp16, per-token quant 2 + kv-quant only decreases about 0.02 MMLU score on average, but throughput increases about 50%. We will release per-token quant code and merge it with the latest vllm version soon.

Hi, when will this be released? Can we have partial quant 1 and partial quant 2 first?

We try to release w8a8 per-token quant this evening, as there are still many merge conflicts to solve.

HandH1998 avatar Nov 21 '23 06:11 HandH1998

Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know there is comment in #1112 about this. i was running our llama2 model with smoothquant but i faced big performance drop. so if you don't merge yet. it is better to merge in your pr.

We are addressing the performance drop problem with a more solid method. We analyze the activation distribution and are trying to fix the problem with per channel quantization. The job will be done next week. You can try out kv cache quant #1507 if you want to avoid the problem for now. It can lift the throughput by 15% with minial performance loss.

Will w8a8 partial release this week?

@yunfeng-scale @gesanqiu Hello! Here, we provide the experiment results based on vllm 0.1.7 for llama-13b including: w8a8, two partial quant methods and two per-token quant methods. w8a8: both activation and weight use per-tensor quant partial quant 1: only down_proj uses fp16 partial quant 2: both o_proj and down_proj use fp16 per-token quant 1: only down_proj uses per-token quant per-token quant 2: both o_proj and down_proj use quant Throughput Model Throughput (tokens/s) Percentage Increase fp16 528.2081 +0% w8a8 + kv-quant 797.8306 +50.95% partial quant 1 + kv-quant 740.8924 +40.32% partial quant 2 + kv-quant 733.9158 +38.82% per-token quant 1 + kv-quant 793.8667 +50.19% per-token quant 2 + kv-quant 775.3904 +46.78% MMLU scores model STEM Social Sciences Humanities Other Average fp16 0.4056 0.4965 0.4750 0.4855 0.4657 w8a8 + kv-quant 0.2584 0.2559 0.2509 0.2589 0.2570 partial quant 1 + kv-quant 0.3955 0.4806 0.4666 0.4805 0.4558 partial quant 2 + kv-quant 0.3987 0.4810 0.4712 0.4765 0.4568 per-token quant 1 + kv-quant 0.3827 0.4663 0.4369 0.4618 0.4369 per-token quant 2 + kv-quant 0.3728 0.4714 0.4500 0.4833 0.4443 As you see, per-token quant can realize better performance. Compared with fp16, per-token quant 2 + kv-quant only decreases about 0.02 MMLU score on average, but throughput increases about 50%. We will release per-token quant code and merge it with the latest vllm version soon.

Hi, when will this be released? Can we have partial quant 1 and partial quant 2 first?

We have released per-token quant code based vllm 0.2.1. Welcome to try it!

HandH1998 avatar Nov 22 '23 12:11 HandH1998

@yunfeng-scale Hello there~ We have fixed the loading and running problems in v0.2.1, and applied per-token quantization on o_proj and down_proj, which leads to better model performance. Could you please take some time out of your busy schedule to review our code?

AniZpZ avatar Nov 22 '23 12:11 AniZpZ

Hi there, I tested this patch on A40 card with codellama-13b, and benchmark results show as follows:

  1. benchmark_throughput(include results from #1112 )
Model Throughput(tokens/s) Percentage Increse
FP16 742.00 +0%
KV_INT8 1146.75 +54.5%
W8A8 1219.20 +64.3%
W8A8 + KV_INT8 1648.25 +122.1%
W8A8 per-token 1218.42 +64.2%
W4A16(AWQ) 857.16 +15.5%
  • notice that the throughput of awq model is tested with max_num_seqs=64, in my test case it may be bigger when max_num_seqs in [64, 96].
  1. benchmark_latency
Batch size Input Length Output Length FP16(s) W8A8 per-token(s) AWQ(s)
1 128 32 1.5795 0.9617 0.6867
8 128 32 1.9444 1.1977 1.4441
16 128 32 2.3071 1.4768 2.311
32 128 32 3.2188 2.0462 4.2129
64 128 32 6.828 3.1846 -
32 256 32 5.2961 2.9724 7.5002
32 512 64 16.3069 7.7758 18.1441
32 1024 128 49.6551 22.9284 44.2433
  • Tested lots of cases, only update part of them.
  • Results show W8A8 per-token quantization has a better performance under large batch size and large context scenario.
  • vLLM has a memory manage issue on awq model, occur OOM when batch_size=64.
  1. prefilling latenct and decoding latency results under batch_size =1
Input Length FP16 prefill(ms/token) FP16 decode(ms/token) W8A8 per-token prefill(ms/token) W8A8 per-token decode(ms/token) AWQ prefill(ms/token) AWQ decode(ms/token)
128 58 28 36 29 92 19
256 79 49 47 29 180 19
512 134 49 84 30 357 19
1024 274 50 166 30 708 19
2048 531 51 333 32 1436 20
  • smoothquant has a better performance on prefilling phase.

HumanEval results show as follows:

model MBPP(Code) HumanEval_x HumanEval_CPP HumanEval_Java HumanEval_Python Human_Eval_Avg Multiple_Humaneval_Python Multiple_Humaneval_CPP Multiple_Humaneval_Java Multiple_Humaneval_Js Multiple_Humaneval_Avg
FP16 0.619 0.416 0.512 0.5 0.512 0.518 0.54 0.404 0.392 0.509 0.376
W8A8-per-token 0.556 0.369 0.384 0.451 0.5 0.476 0.54 0.342 0.348 0.441 0.328
W4A16G128 0.545 0.361 0.39 0.445 0.5 0.476 0.466 0.348 0.354 0.416 0.326
  • It shows that smoothquant decrese about 0.05 HumanEval score than fp16, but better than awq.

gesanqiu avatar Nov 23 '23 10:11 gesanqiu

Hi, Is this per-token quantization patch only support single card?

I tested this patch on A10 with llama2-7b, there is no problem if I run with single card. But if I run with -tp 2, the result seems strange.

Maybe the per_token scale factor should applied before all_reduce, and re-caculate the new per_token scale factor after all_reduce?

def forward(self, x):
    gate_up, _ = self.gate_up_proj(x)
    scale = None
    if self.use_int8:
        # TODO: currently gate up share same scale, use seperate scales
        x, *scale = self.act_fn(gate_up)
    else:
        x = self.act_fn(gate_up)
    ## apply scale before all_reduce ? 
    x, _ = self.down_proj(x)
    ## update scale after all_reduce ? 
    return x, scale

qiaoxj07 avatar Nov 30 '23 07:11 qiaoxj07

Hi, Is this per-token quantization patch only support single card?

I tested this patch on A10 with llama2-7b, there is no problem if I run with single card. But if I run with -tp 2, the result seems strange.

Maybe the per_token scale factor should applied before all_reduce, and re-caculate the new per_token scale factor after all_reduce?

def forward(self, x):
    gate_up, _ = self.gate_up_proj(x)
    scale = None
    if self.use_int8:
        # TODO: currently gate up share same scale, use seperate scales
        x, *scale = self.act_fn(gate_up)
    else:
        x = self.act_fn(gate_up)
    ## apply scale before all_reduce ? 
    x, _ = self.down_proj(x)
    ## update scale after all_reduce ? 
    return x, scale

Thanks for your issue! We have confirmed it is a bug, as we didn't take TP into account when designing per-token quant. The correct way is to do per-token dequant before all_reduce. We will fix it soon!

HandH1998 avatar Nov 30 '23 09:11 HandH1998

Hi, Is this per-token quantization patch only support single card?

I tested this patch on A10 with llama2-7b, there is no problem if I run with single card. But if I run with -tp 2, the result seems strange.

Maybe the per_token scale factor should applied before all_reduce, and re-caculate the new per_token scale factor after all_reduce?

def forward(self, x):
    gate_up, _ = self.gate_up_proj(x)
    scale = None
    if self.use_int8:
        # TODO: currently gate up share same scale, use seperate scales
        x, *scale = self.act_fn(gate_up)
    else:
        x = self.act_fn(gate_up)
    ## apply scale before all_reduce ? 
    x, _ = self.down_proj(x)
    ## update scale after all_reduce ? 
    return x, scale

We have update the code and fix tp problem when enable per token quant. Please pull the code and try again.

AniZpZ avatar Dec 01 '23 11:12 AniZpZ