lmdeploy [Feature] support qqq(w4a8) for lmdeploy

Motivation

We have implemented W4A8 quantization for the lmdeploy turbomind backend using our quantization algorithm QQQ to enhance inference throughput. We hope that lmdeploy users will find this beneficial. Additionally, we have submitted a PR to vLLM, which has been incorporated into vLLM v0.5.4.

Modification

We have completed the following tasks to enable the w4a8 pipeline:

[x] Converted our QQQ quantized model weights to lmdeploy format.
[x] Enabled the turbomind backend to load quantized model weights.
[x] Added the Marlin QQQ w4a8 GEMM kernel.
[x] Fused online quantization with element-wise operations such as RMSnorm and Silu.
[x] Modified the inference pipeline to accommodate online activation quantization.
[x] Fused gate and up weights into one weight.

Use cases

First you need to export the quantized model weights using our repo. Then, you can enable QQQ in the same manner as you would enable AWQ. Here, we provide two examples for inference and service.

Inference

from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig, ChatTemplateConfig

# Here we use completion template. You can modify capability following the official guidance.
chat_config = ChatTemplateConfig(model_name='llama2', capability='completion')

backend_config = TurbomindEngineConfig(model_format='qqq')
model_path = your_quantized_model_path

pipe = pipeline(model_path=model_path,
                chat_template_config=chat_config,
                backend_config=backend_config,
                log_level='INFO')

gen_config = GenerationConfig(top_p=0.95,
                              temperature=0.8,
                              repetition_penalty=1.0,
                              random_seed=0,
                              max_new_tokens=512)
prompts = ["Hi, pls intro yourself", "Shanghai is"]
response = pipe(prompts, gen_config=gen_config)
print(response)

Service

lmdeploy serve api_server your_quantized_model_path --backend turbomind --model-format qqq

Benchmark

Accuracy

We employ OpenCompass to evaluate the quantized model. Here we provide the evaluation results for llama2-13b-base.

	ceval	mmlu	triviaqa	gsm8k
FP16	38.46	41.35	67.36	29.72
AWQ-g128(W4A16)	36.00	41.48	66.53	29.87
QQQ(W4A8)	32.93	41.09	64.35	25.70
QQQ-g128(W4A8)	36.16	40.94	65.85	28.51

You can add the following script to configs to reproduce our results.

from mmengine.config import read_base
from opencompass.models.turbomind import TurboMindModel

with read_base():
    # choose a list of datasets
    from .datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
    from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
    from .datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets
    from .datasets.gsm8k.gsm8k_gen_1d7fe4 import gsm8k_datasets
    from .datasets.humaneval.humaneval_gen_8e312c import humaneval_datasets
    # and output the results in a choosen format
    from .summarizers.medium import summarizer

datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])

llama2_13b_base = dict(
        type=TurboMindModel,
        abbr='llama2-13b-base-qqq-g128',
        path=your_quantized_model_path,
        engine_config=dict(session_len=2048,
                           max_batch_size=8,
                           rope_scaling_factor=1.0,
                           model_format="qqq"),
        gen_config=dict(top_k=1, top_p=0.8,
                        temperature=1.0,
                        max_new_tokens=100),
        max_out_len=100,
        max_seq_len=2048,
        batch_size=8,
        concurrency=8,
        run_cfg=dict(num_gpus=1, num_procs=1),
    )

models = [llama2_13b_base]

Throughput

We use the script profile_restful_api.py and ShareGPT dataset to benchmark throughput. Here we provide the results for llama2-13b-base on one A100-80G. Settings:

concurrency: 128
num_prompts: 1000
number of prompt tokens: 248339
number of completion tokens: 240582

	RPS (request per second)	token throughput (completion token)	token throughput (prompt + completion token)
FP16	7.300	1756.165	3568.954
AWQ-g128(W4A16)	8.272	1990.156	4044.479
QQQ(W4A8)	9.454	2296.056	4666.144
QQQ-g128(W4A8)	8.484	2041.167	4148.146

Aug 09 '24 06:08 HandH1998

Hi @HandH1998 Nice work! May you merge the latest main branch and fix the conflicts?

Aug 09 '24 06:08 zhyncs

We might wait for the merge of https://github.com/InternLM/lmdeploy/pull/2090.

Aug 09 '24 06:08 zhyncs

Brilliant!

Aug 09 '24 06:08 lvhan028

When implementing W8A8 in the future, some components may be reused.

Aug 09 '24 06:08 zhyncs

Hi @HandH1998 Nice work! May you merge the latest main branch and fix the conflicts?

Done

Aug 09 '24 06:08 HandH1998

clang-format 11 is ok ref https://github.com/muttleyxd/clang-tools-static-binaries/releases/download/master-609f1513/clang-format-11_linux-amd64 @HandH1998

Aug 09 '24 07:08 zhyncs

@HandH1998 The Windows build is still failing.

Aug 09 '24 11:08 zhyncs

@HandH1998 It seems that, if w4a8 is per-channel quantized without group( group_size=-1), the w8a8 triton kernel in lmdeploy repo can be easily modified into a w4a8 one. Will the speedup of QQQ cuda implementation be similar to that triton-version one, since things get a lot simpler without group-quantization

Aug 13 '24 12:08 brisker

@HandH1998 Marlin W4A16 is mainly optimized for A100, but compared to TurboMind AWQ, its performance is still worse. Marlin's performance on H100 is average, especially compared to #2090, the gap is very large. After #2090 merges next week, this PR will be reviewed. There are probably 2 strategies: one is to review based on the current implementation first (of course, assuming you still need to merge the latest main and resolve some conflicts), and then reimplement it later according to the optimized implementation in TurboMind. Another strategy is to reimplement it directly (which can be based on some existing components), we'll discuss this at that time. @lzhangzz cc @irexyc @lvhan028

Aug 17 '24 18:08 zhyncs

And the difference should not be significant on A100. I have roughly verified it using SGLang's Marlin AWQ and LMDeploy TurboMind's AWQ on Llama 3.1 8B Instruct, and their performance is basically close (though I don't remember if this was based on whether LMDeploy had already fixed that chunked prefill bug).

Aug 17 '24 18:08 zhyncs

And the difference should not be significant on A100. I have roughly verified it using SGLang's Marlin AWQ and LMDeploy TurboMind's AWQ on Llama 3.1 8B Instruct, and their performance is basically close (though I don't remember if this was based on whether LMDeploy had already fixed that chunked prefill bug).

That's the old AWQ kernels. The new kernels achieve 26+ RPS with A100 and Llama 3.1 8B.

Aug 19 '24 07:08 lzhangzz

@HandH1998 May you resolve the conflicts in these days? After that, @lzhangzz will help rewrite with the TurboMind’s style. We should move forward together.

Aug 26 '24 12:08 zhyncs

@HandH1998 May you resolve the conflicts in these days? After that, @lzhangzz will help rewrite with the TurboMind’s style. We should move forward together.

I am working on it. Since the main branch changed a lot, I still need time to resolve the conflicts and fix new bugs. Probably I can finish it in two days.

Aug 27 '24 09:08 HandH1998

@zhyncs @lzhangzz I have resolved the conflicts, and you can continue to do the optimization work. Two checks failed, but I think they are irrelevant with my code.

Aug 30 '24 03:08 HandH1998