vllm GPTQ / Quantization support?

Will vLLM support 4-bit GPTQ models?

Jun 21 '23 02:06 nikshepsvn

Thanks for the feature request! Quantization is not currently supported, but it's definitely on our roadmap. Please stay tuned.

Jun 21 '23 03:06 WoosukKwon

How do I best go about tracking this? Is there a discord or public roadmap somewhere I can look at?

Jun 22 '23 01:06 nikshepsvn

How do I best go about tracking this? Is there a discord or public roadmap somewhere I can look at?

See Roadmap here: https://github.com/vllm-project/vllm/issues/244

Aug 15 '23 06:08 Symbolk

I looked into this a bit today and it seems straight forward to integrate AutoGPTQ into vllm, so I implemented a preliminary version for LLaMA (see this commit) and did a few benchmarks on single A100-80G. I don't know why but It's slower than expected.

python benchmark_throughput.py --model TheBloke/Llama-2-13B-chat-GPTQ --dataset ShareGPT_V3_unfiltered_cleaned_split.json

Model	Throughout (requests/s)	Throughout (tokens/s)
meta-llama/Llama-2-13b-chat-hf	4.00	1915
TheBloke/Llama-2-13B-chat-GPTQ	3.32	1587
TheBloke/Llama-2-70B-chat-GPTQ	1.09	519

Aug 23 '23 14:08 chu-tianxiang

I looked into this a bit today and it seems straight forward to integrate AutoGPTQ into vllm, so I implemented a preliminary version for LLaMA (see this commit) and did a few benchmarks on single A100-80G. I don't know why but It's slower than expected.
python benchmark_throughput.py --model TheBloke/Llama-2-13B-chat-GPTQ --dataset ShareGPT_V3_unfiltered_cleaned_split.json
Model Throughout (requests/s) Throughout (tokens/s) meta-llama/Llama-2-13b-chat-hf 4.00 1915 TheBloke/Llama-2-13B-chat-GPTQ 3.32 1587 TheBloke/Llama-2-70B-chat-GPTQ 1.09 519

what's the baseline with normal version?

Aug 25 '23 05:08 osilverstein

I looked into this a bit today and it seems straight forward to integrate AutoGPTQ into vllm, so I implemented a preliminary version for LLaMA (see this commit) and did a few benchmarks on single A100-80G. I don't know why but It's slower than expected.
python benchmark_throughput.py --model TheBloke/Llama-2-13B-chat-GPTQ --dataset ShareGPT_V3_unfiltered_cleaned_split.json
Model Throughout (requests/s) Throughout (tokens/s) meta-llama/Llama-2-13b-chat-hf 4.00 1915 TheBloke/Llama-2-13B-chat-GPTQ 3.32 1587 TheBloke/Llama-2-70B-chat-GPTQ 1.09 519
what's the baseline with normal version?

If you mean the throughput, in the above table TheBloke/Llama-2-13B-chat-GPTQ is quantized from meta-llama/Llama-2-13b-chat-hf and the throughput is about 17% less.

I dug into the kernel code of quant linear layer and found that it falls back to dequantization followed by fp16 matrix multiplication when the batch size is bigger than 8, so the performance degradation is understandable.

Aug 25 '23 11:08 chu-tianxiang

As an update, I added tensor parallel QuantLinear layer and supported most AutoGPT compatible models in this branch. The code has not been thoroughly tested yet because the combinations of model architectures and GPTQ settings are way too many.

Aug 26 '23 15:08 chu-tianxiang

@chu-tianxiang I tried forking your vllm-gptq branch and was successful deploying the TheBloke/Llama-2-13b-Chat-GPTQ model. However, when I tried the TheBloke/Llama-2-7b-Chat-GPTQ model, it threw the following exception whenever I made a query to the model. I wonder if the issue is with the model itself or something else. I'll dig further into this when I have the chance but it's likely the Sampler was generating the probability tensor with invalid values

INFO:     13.229.18.8:52663 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/ubuntu/vllm-gptq/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish
    task.result()
  File "/home/ubuntu/vllm-gptq/vllm/engine/async_llm_engine.py", line 351, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
                               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/engine/async_llm_engine.py", line 330, in engine_step
    request_outputs = await self.engine.step_async()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/engine/async_llm_engine.py", line 191, in step_async
    output = await self._run_workers_async(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/engine/async_llm_engine.py", line 220, in _run_workers_async
    all_outputs = await asyncio.gather(*all_outputs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.pyenv/versions/3.11.3/lib/python3.11/asyncio/tasks.py", line 684, in _wrap_awaitable
    return (yield from awaitable.__await__())
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(RuntimeError): ray::RayWorker.execute_method() (pid=68296, ip=172.31.40.30, actor_id=3b90ca9f90ebf20a67ae6c2c01000000, repr=<vllm.engine.ray_utils.RayWorker object at 0x7ef09122dad0>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/engine/ray_utils.py", line 32, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.pyenv/versions/vllm-gptq/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/worker/worker.py", line 305, in execute_model
    output = self.model(
             ^^^^^^^^^^^
  File "/home/ubuntu/.pyenv/versions/vllm-gptq/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/model_executor/models/llama.py", line 296, in forward
    next_tokens = self.sampler(self.lm_head.weight, hidden_states,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.pyenv/versions/vllm-gptq/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/model_executor/layers/sampler.py", line 85, in forward
    return _sample(probs, logprobs, input_metadata)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/model_executor/layers/sampler.py", line 451, in _sample
    sample_results = _random_sample(seq_groups, is_prompts,
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/vllm-gptq/vllm/model_executor/layers/sampler.py", line 342, in _random_sample
    random_samples = torch.multinomial(probs,
                     ^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Sep 29 '23 01:09 singularity-sg

Hi, If anyone wants try GPTQ quantizationo in vLLM. Please use this repo QLLM to quantize model(LLama) and it would compatiable AWQ in vLLM. And Of courcr you can select AWQ to quantize it as well.

Oct 25 '23 11:10 wejoncy

is baichuan-gptq supported?

Dec 07 '23 09:12 David-Lee-1990

支持Qwen-72B-Chat-Int4加速吗？

Dec 08 '23 07:12 sssuperrrr

Hi, If anyone wants try GPTQ quantizationo in vLLM. Please use this repo QLLM to quantize model(LLama) and it would compatiable AWQ in vLLM. And Of courcr you can select AWQ to quantize it as well.

Something is off with this QLLM gptq quantization @wejoncy ... all the dependencies aren't specified in requirements file. Also, 3 times tried quantizing and every time it breaks when it tries to save the file or quant is done. Tried Llama2-70b and mistral 7b

Dec 09 '23 23:12 uncensorie

Hi, If anyone wants try GPTQ quantizationo in vLLM. Please use this repo QLLM to quantize model(LLama) and it would compatiable AWQ in vLLM. And Of courcr you can select AWQ to quantize it as well.

Something is off with this QLLM gptq quantization @wejoncy ... all the dependencies aren't specified in requirements file. Also, 3 times tried quantizing and every time it breaks when it tries to save the file or quant is done. Tried Llama2-70b and mistral 7b

Hi, Thanks for try this out and sorry for the inconvenient. This bug has been fixed in latest:main. For now, you can have too ways to use GPTQ quant method in vLLM with qllm tool.

such as Llama-families, convert to AWQ ifi you didn't enable act_order and set bits==4 and there is no mix bits inside.
use GPTQ directly. But the GPTQ branch in vLLM is on the way merged.

Dec 10 '23 02:12 wejoncy

Is there any update for 8bit support? That would help Mixtral generate useable outputs on a single (non-overpriced) GPU.

Feb 02 '24 05:02 jacobwarren

I have successfully used both GPTQ and AWQ models with vLLM.

Should this issue be considered solved @WoosukKwon?

Feb 02 '24 17:02 hmellor

@hmellor it currently works with 4-bit, but not 8-bit. Currently you have to use chu-tianxiang/vllm-gptq to get 8-bit support.

Feb 03 '24 22:02 jacobwarren

Closing as this was resolved by #2330

Mar 06 '24 09:03 hmellor

Is vLLM well-supporting the int-2 GPTW models?

Thank you very much!

Jun 27 '24 21:06 SuperBruceJia

@hmellor @singularity-sg @jacobwarren @wejoncy @Symbolk @WoosukKwon

Jun 27 '24 21:06 SuperBruceJia

vllm vllm copied to clipboard

GPTQ / Quantization support?

vllm
vllm copied to clipboard