vllm icon indicating copy to clipboard operation
vllm copied to clipboard

GPTQ / Quantization support?

Open nikshepsvn opened this issue 1 year ago • 2 comments

Will vLLM support 4-bit GPTQ models?

nikshepsvn avatar Jun 21 '23 02:06 nikshepsvn

Thanks for the feature request! Quantization is not currently supported, but it's definitely on our roadmap. Please stay tuned.

WoosukKwon avatar Jun 21 '23 03:06 WoosukKwon

How do I best go about tracking this? Is there a discord or public roadmap somewhere I can look at?

nikshepsvn avatar Jun 22 '23 01:06 nikshepsvn

How do I best go about tracking this? Is there a discord or public roadmap somewhere I can look at?

See Roadmap here: https://github.com/vllm-project/vllm/issues/244

Symbolk avatar Aug 15 '23 06:08 Symbolk

I looked into this a bit today and it seems straight forward to integrate AutoGPTQ into vllm, so I implemented a preliminary version for LLaMA (see this commit) and did a few benchmarks on single A100-80G. I don't know why but It's slower than expected.

python benchmark_throughput.py --model TheBloke/Llama-2-13B-chat-GPTQ --dataset ShareGPT_V3_unfiltered_cleaned_split.json
Model Throughout (requests/s) Throughout (tokens/s)
meta-llama/Llama-2-13b-chat-hf 4.00 1915
TheBloke/Llama-2-13B-chat-GPTQ 3.32 1587
TheBloke/Llama-2-70B-chat-GPTQ 1.09 519

chu-tianxiang avatar Aug 23 '23 14:08 chu-tianxiang

I looked into this a bit today and it seems straight forward to integrate AutoGPTQ into vllm, so I implemented a preliminary version for LLaMA (see this commit) and did a few benchmarks on single A100-80G. I don't know why but It's slower than expected.

python benchmark_throughput.py --model TheBloke/Llama-2-13B-chat-GPTQ --dataset ShareGPT_V3_unfiltered_cleaned_split.json

Model Throughout (requests/s) Throughout (tokens/s) meta-llama/Llama-2-13b-chat-hf 4.00 1915 TheBloke/Llama-2-13B-chat-GPTQ 3.32 1587 TheBloke/Llama-2-70B-chat-GPTQ 1.09 519

what's the baseline with normal version?

osilverstein avatar Aug 25 '23 05:08 osilverstein

I looked into this a bit today and it seems straight forward to integrate AutoGPTQ into vllm, so I implemented a preliminary version for LLaMA (see this commit) and did a few benchmarks on single A100-80G. I don't know why but It's slower than expected.

python benchmark_throughput.py --model TheBloke/Llama-2-13B-chat-GPTQ --dataset ShareGPT_V3_unfiltered_cleaned_split.json

Model Throughout (requests/s) Throughout (tokens/s) meta-llama/Llama-2-13b-chat-hf 4.00 1915 TheBloke/Llama-2-13B-chat-GPTQ 3.32 1587 TheBloke/Llama-2-70B-chat-GPTQ 1.09 519

what's the baseline with normal version?

If you mean the throughput, in the above table TheBloke/Llama-2-13B-chat-GPTQ is quantized from meta-llama/Llama-2-13b-chat-hf and the throughput is about 17% less.

I dug into the kernel code of quant linear layer and found that it falls back to dequantization followed by fp16 matrix multiplication when the batch size is bigger than 8, so the performance degradation is understandable.

chu-tianxiang avatar Aug 25 '23 11:08 chu-tianxiang

As an update, I added tensor parallel QuantLinear layer and supported most AutoGPT compatible models in this branch. The code has not been thoroughly tested yet because the combinations of model architectures and GPTQ settings are way too many.

chu-tianxiang avatar Aug 26 '23 15:08 chu-tianxiang

@chu-tianxiang I tried forking your vllm-gptq branch and was successful deploying the TheBloke/Llama-2-13b-Chat-GPTQ model. However, when I tried the TheBloke/Llama-2-7b-Chat-GPTQ model, it threw the following exception whenever I made a query to the model. I wonder if the issue is with the model itself or something else. I'll dig further into this when I have the chance but it's likely the Sampler was generating the probability tensor with invalid values

INFO:     13.229.18.8:52663 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/ubuntu/vllm-gptq/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish
    task.result()
  File "/home/ubuntu/vllm-gptq/vllm/engine/async_llm_engine.py", line 351, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
                               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/engine/async_llm_engine.py", line 330, in engine_step
    request_outputs = await self.engine.step_async()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/engine/async_llm_engine.py", line 191, in step_async
    output = await self._run_workers_async(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/engine/async_llm_engine.py", line 220, in _run_workers_async
    all_outputs = await asyncio.gather(*all_outputs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.pyenv/versions/3.11.3/lib/python3.11/asyncio/tasks.py", line 684, in _wrap_awaitable
    return (yield from awaitable.__await__())
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(RuntimeError): ray::RayWorker.execute_method() (pid=68296, ip=172.31.40.30, actor_id=3b90ca9f90ebf20a67ae6c2c01000000, repr=<vllm.engine.ray_utils.RayWorker object at 0x7ef09122dad0>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/engine/ray_utils.py", line 32, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.pyenv/versions/vllm-gptq/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/worker/worker.py", line 305, in execute_model
    output = self.model(
             ^^^^^^^^^^^
  File "/home/ubuntu/.pyenv/versions/vllm-gptq/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/model_executor/models/llama.py", line 296, in forward
    next_tokens = self.sampler(self.lm_head.weight, hidden_states,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.pyenv/versions/vllm-gptq/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/model_executor/layers/sampler.py", line 85, in forward
    return _sample(probs, logprobs, input_metadata)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/model_executor/layers/sampler.py", line 451, in _sample
    sample_results = _random_sample(seq_groups, is_prompts,
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/model_executor/layers/sampler.py", line 342, in _random_sample
    random_samples = torch.multinomial(probs,
                     ^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

singularity-sg avatar Sep 29 '23 01:09 singularity-sg

Hi, If anyone wants try GPTQ quantizationo in vLLM. Please use this repo QLLM to quantize model(LLama) and it would compatiable AWQ in vLLM. And Of courcr you can select AWQ to quantize it as well.

wejoncy avatar Oct 25 '23 11:10 wejoncy

is baichuan-gptq supported?

David-Lee-1990 avatar Dec 07 '23 09:12 David-Lee-1990

支持Qwen-72B-Chat-Int4加速吗?

sssuperrrr avatar Dec 08 '23 07:12 sssuperrrr

Hi, If anyone wants try GPTQ quantizationo in vLLM. Please use this repo QLLM to quantize model(LLama) and it would compatiable AWQ in vLLM. And Of courcr you can select AWQ to quantize it as well.

Something is off with this QLLM gptq quantization @wejoncy ... all the dependencies aren't specified in requirements file. Also, 3 times tried quantizing and every time it breaks when it tries to save the file or quant is done. Tried Llama2-70b and mistral 7b

Screenshot 2023-12-09 at 15 21 57

uncensorie avatar Dec 09 '23 23:12 uncensorie

Hi, If anyone wants try GPTQ quantizationo in vLLM. Please use this repo QLLM to quantize model(LLama) and it would compatiable AWQ in vLLM. And Of courcr you can select AWQ to quantize it as well.

Something is off with this QLLM gptq quantization @wejoncy ... all the dependencies aren't specified in requirements file. Also, 3 times tried quantizing and every time it breaks when it tries to save the file or quant is done. Tried Llama2-70b and mistral 7b

Hi, Thanks for try this out and sorry for the inconvenient. This bug has been fixed in latest:main. For now, you can have too ways to use GPTQ quant method in vLLM with qllm tool.

  1. such as Llama-families, convert to AWQ ifi you didn't enable act_order and set bits==4 and there is no mix bits inside.
  2. use GPTQ directly. But the GPTQ branch in vLLM is on the way merged.

wejoncy avatar Dec 10 '23 02:12 wejoncy

Is there any update for 8bit support? That would help Mixtral generate useable outputs on a single (non-overpriced) GPU.

jacobwarren avatar Feb 02 '24 05:02 jacobwarren

I have successfully used both GPTQ and AWQ models with vLLM.

Should this issue be considered solved @WoosukKwon?

hmellor avatar Feb 02 '24 17:02 hmellor

@hmellor it currently works with 4-bit, but not 8-bit. Currently you have to use chu-tianxiang/vllm-gptq to get 8-bit support.

jacobwarren avatar Feb 03 '24 22:02 jacobwarren

Closing as this was resolved by #2330

hmellor avatar Mar 06 '24 09:03 hmellor

Is vLLM well-supporting the int-2 GPTW models?

Thank you very much!

SuperBruceJia avatar Jun 27 '24 21:06 SuperBruceJia

@hmellor @singularity-sg @jacobwarren @wejoncy @Symbolk @WoosukKwon

SuperBruceJia avatar Jun 27 '24 21:06 SuperBruceJia