vllm GGUF support

GGUF support

Open viktor-ferenczi opened this issue 9 months ago • 34 comments

Motivation

AWQ is nice, but if you want more control over the bit depth (thus VRAM usage), then GGUF may be a better option. A wide range of models are available from TheBloke at various bit depths, so everybody can use the biggest one which can fit into their GPUs.

I cannot find a high-throughput batch inference engine which can load GGUF, maybe there is none. (vLLM cannot load it either.)

Related resources

https://github.com/ggerganov/llama.cpp

https://huggingface.co/TheBloke

Sep 10 '23 00:09 viktor-ferenczi

FYI, high throughput is hard when using quantized models in general, regardless of which framework. But if you can manage to run with batch size (data parallelism) of less than 8 or ideally less than 4, then it will increase throughput. At 16, there is no gain with quantized models because of how quantized inference works.

Sep 10 '23 14:09 casper-hansen

Does the AWQ implementation support higher than 4 bits per weight, for example 8 bits?

Sep 11 '23 01:09 viktor-ferenczi

Does the AWQ implementation support higher than 4 bits per weight, for example 8 bits?

Not yet. It’s 4-bit only at the moment.

Sep 11 '23 07:09 casper-hansen

Thank you.

Could you please point me to some technical details what makes it hard to implement high throughput (batching, caching) and using quantization (unpacking quantized data on demand) at the same time? These seem to be pretty orthogonal for me.

I'm learning into the LLM (transformer) implementations and have past coding experience. Therefore I'm really interested in knowing the details.

Sep 11 '23 07:09 viktor-ferenczi

Thank you.

Could you please point me to some technical details what makes it hard to implement high throughput (batching, caching) and using quantization (unpacking quantized data on demand) at the same time? These seem to be pretty orthogonal for me.

I'm learning into the LLM (transformer) implementations and have past coding experience. Therefore I'm really interested in knowing the details.

Yes. The way quantization works is that weights are quantized to 4 bits. Then at inference time, you run dequantization to FP16 to be able to perform matrix multiplication. This dequantization is the essence of any quantized model and is why quantized models generally struggle with large batch sizes because it becomes compute-bound doing dequantization rather than the actual matrix multiplication. At batch size 1, you are memory bound, which speeds up your inference by a great deal.

Sep 11 '23 08:09 casper-hansen

On your documentation page there is an excellent high level summary on how to add support for a new model.

Could you please write down a few bullet points on where to look in the code (high level) if I want to add a new input format (GGUF)?

I would start by searching for all the code involved in reading model configuration and parameters.

I'm also aware that for 8 bit operation at least the dequantization code will need to be extended as well.

Also, what do you think about AWQ? Should we expect 8 bit support be added to AWQ in the near future? Because that would make any work on GGUF pointless.

If I would work on GGUF support it would certainly be based on the AWQ branch. Which is the correct branch to look at? (There seem to be multiple and I'm a bit confused.)

Thank you!

Sep 11 '23 13:09 viktor-ferenczi

This would be a good starting point: https://github.com/oobabooga/text-generation-webui/blob/9331ab4798f392ee0634a4194a6bae370afd435f/modules/metadata_gguf.py

https://github.com/oobabooga/text-generation-webui/commit/9331ab4798f392ee0634a4194a6bae370afd435f#diff-4f7989a17db10e843396d051093357685cd8b4f69e186734e5be93e293689ae8

Sep 12 '23 14:09 bet0x

8-bit GGUF could be a perfect input format for the upcoming W8A8 inference mode, see #1112

Sep 21 '23 06:09 viktor-ferenczi

Thank you. Could you please point me to some technical details what makes it hard to implement high throughput (batching, caching) and using quantization (unpacking quantized data on demand) at the same time? These seem to be pretty orthogonal for me. I'm learning into the LLM (transformer) implementations and have past coding experience. Therefore I'm really interested in knowing the details.

Yes. The way quantization works is that weights are quantized to 4 bits. Then at inference time, you run dequantization to FP16 to be able to perform matrix multiplication. This dequantization is the essence of any quantized model and is why quantized models generally struggle with large batch sizes because it becomes compute-bound doing dequantization rather than the actual matrix multiplication. At batch size 1, you are memory bound, which speeds up your inference by a great deal.

Could you please elaborate more on this or point to some code? At first glance, the amount of work is proportional to a batch size in both cases(dequantization, matrix-multiplication). Or at larger batches, matrix multiplication is implemented more efficiently (with some CUDA tricks) while dequantization stays the same?

Nov 01 '23 17:11 sh1ng

The amount of work scales linearly. The problem is when you increase batch size too much because then your GPU will be 100% utilized just doing matrix multiplication. Once that happens, the dequantization overhead will start showing in how fast you can run inference compared to FP16.

This is also why finding algorithms that can quantize to W4A4 (very hard) or W8A8 (hard) is essential for higher throughput because you remove the need for dequantization since you can run natively on tensor cores.

Nov 01 '23 17:11 casper-hansen

I understand the motivation of W4A4 and W8A8 as everything can be done solely in INT4/INT8.
But what I don't fully understand is the following

load weights in fp16  -> matrix multiplication in fp16

load weights in int4 or int8 -> dequantization to fp16 -> matrix multiplication in fp16

if all the above blocks scale linearly(performance-wise) with batch size the second scenario must always be faster for any batch size if it's faster for batch_size=1. So I expect some non-linearity in one of the steps.

Nov 01 '23 18:11 sh1ng

This is the difference between memory-bound and compute-bound. At small batch sizes, you are memory bound meaning that you are limited by how fast you can pass the model’s weights around. This makes quantized models faster.

However, when we talk about large batch sizes, we move away from being memory bound. It is no longer an issue of how fast we can transport the weights but rather how much time we spend doing computations. You have to think of it as being stuck computing matmuls rather than waiting for weights to be transported through memory.

Nov 01 '23 18:11 casper-hansen

@casper-hansen thanks for the clarification. I'm still trying to connect the dots.

Is loading->dequantization->multiplication fuzed into a single kernel? Could you point me to some source code?

Nov 01 '23 18:11 sh1ng

Weight loading happens at startup time and then it’s transported through registers. This process is not really transparent but it all happens in the quantization kernel that you can find in csrc.

Nov 01 '23 18:11 casper-hansen

no GGUF support?

Nov 26 '23 08:11 ogcatt

+1 GGUF

Dec 31 '23 11:12 Namec999

+1 for gguf please

Jan 17 '24 21:01 delta-whiplash

Slowly we should go for EXL2 instead :)

Jan 17 '24 21:01 viktor-ferenczi

Over the past two weeks, while I was learning the llama.cpp code and simultaneously writing a small repo running GGUF inference in Pytorch, I thought it would be better to make vLLM work too. Through a process of trial and error, I've managed to develop a preliminary draft of the GGUF support, which you can find in the gguf branch. As of now, It only works for llama and mixtral.

First convert the gguf to torch state dict and tokenizer file using the code in the examples folder

PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python python convert_gguf_to_torch.py --input mixtral-8x7b-instruct-v0.1.Q2_K.gguf --output mixtral-q2k

Then start up the vllm server as usual

python -m vllm.entrypoints.api_server --model mixtral-q2k --quantization gguf

Note:

llama.cpp quantizes both embedding layers and output layers. For simplicity now I de-quantized them before loading into model, but a more decent solution to load them natively is definitely needed.
llama.cpp implements two sets of kernels, WxA8 (see #2067 and #2160) and WxA16. I haven't read the WxA8 cuda code yet and only ported the WxA16 part now, so the latency may be inferior.

Jan 19 '24 05:01 chu-tianxiang

Great job, I will definitely try it as time allows.

My primary use case is running DeepSeek Coder 33B (alternatively CodeLlama 34B) with --tensor-parallel=2.

The A16 likely has better quality (correctness) anyway, so that's a good first choice.

If this approach works well with GGUF, then supporting the EXL2 format may work as well.

Jan 19 '24 06:01 viktor-ferenczi

Jan 31 '24 17:01 theobjectivedad

Feb 02 '24 01:02 SODAsoo07

I made a few updates and moved it to the default branch. Quantized embedding layers and output layers are added, as well as the QxW8 kernels. However the performance improvement over QxA16 seems marginal. I also made the gguf-to-torch conversion implicit, so it's easier to use now:

python -m vllm.entrypoints.api_server --model miqu-1-70b.q2_K.gguf

The single request latency is slightly lower than llama.cpp. The packing of GGUF is very unfriendly for GPU memory access, making it slower than other quant methods. I haven't found a way to measure throughput using llama.cpp server, so no comparison for throughput yet. I'll try making it into a formal PR later.

Feb 02 '24 13:02 chu-tianxiang

I made a few updates and moved it to the default branch. Quantized embedding layers and output layers are added, as well as the QxW8 kernels. However the performance improvement over QxA16 seems marginal. I also made the gguf-to-torch conversion implicit, so it's easier to use now:
python -m vllm.entrypoints.api_server --model miqu-1-70b.q2_K.gguf
The single request latency is slightly lower than llama.cpp. The packing of GGUF is very unfriendly for GPU memory access, making it slower than other quant methods. I haven't found a way to measure throughput using llama.cpp server, so no comparison for throughput yet. I'll try making it into a formal PR later.

can you please check if vllm can inter miqu-1-70b-sf-gptq correctly? it reports oom on my machine. (other 70b gptq model like qwen are just ok)

Feb 13 '24 05:02 tutu329

I have been trying the below command with vLLM version 0.3.0 in Linux Ubuntu CPU machine.

python3 -m vllm.entrypoints.api_server --root-path models/ --model llama-2-7b.Q5_K_M.gguf --host 0.0.0.0 --port 8080

Facing the error as below, any help please.

File "/home/ubuntu/ragas/lib/python3.10/site-packages/transformers/utils/hub.py", line 406, in cached_file raise EnvironmentError( OSError: llama-2-7b.Q5_K_M.gguf is not a local folder and is not a valid model identifier listed on 'https://hugg ingface.co/models' If this is a private repository, make sure to pass a token having permission to this repo either by logging in wi th huggingface-cli login or by passing token=<your_token>

Feb 15 '24 09:02 paramssk

I made a few updates and moved it to the default branch. Quantized embedding layers and output layers are added, as well as the QxW8 kernels. However the performance improvement over QxA16 seems marginal. I also made the gguf-to-torch conversion implicit, so it's easier to use now:
python -m vllm.entrypoints.api_server --model miqu-1-70b.q2_K.gguf
The single request latency is slightly lower than llama.cpp. The packing of GGUF is very unfriendly for GPU memory access, making it slower than other quant methods. I haven't found a way to measure throughput using llama.cpp server, so no comparison for throughput yet. I'll try making it into a formal PR later.

hi, is it support for mistral gguf?. just tried but got error OSError: It looks like the config file at '../models/FL1-base-7B-Q8_0.gguf' is not a valid JSON file.

Mar 04 '24 10:03 Kev1ntan

I made a few updates and moved it to the default branch. Quantized embedding layers and output layers are added, as well as the QxW8 kernels. However the performance improvement over QxA16 seems marginal. I also made the gguf-to-torch conversion implicit, so it's easier to use now:
python -m vllm.entrypoints.api_server --model miqu-1-70b.q2_K.gguf
The single request latency is slightly lower than llama.cpp. The packing of GGUF is very unfriendly for GPU memory access, making it slower than other quant methods. I haven't found a way to measure throughput using llama.cpp server, so no comparison for throughput yet. I'll try making it into a formal PR later.
hi, is it support for mistral gguf?. just tried but got error OSError: It looks like the config file at '../models/FL1-base-7B-Q8_0.gguf' is not a valid JSON file.

Yes I tested mistral gguf with no problem. Please be ware that you have to install the custom branch from source instead of the official build. As an alternative, you can use aphrodite-engine too, which also integrates GGUF support and is easier to install.

Mar 06 '24 14:03 chu-tianxiang

+1 gguf support please!

For those of us that have downloaded a large archive of gguf models, it would be a great benefit to use the vLLM project with the artifacts we have already downloaded and available, rather than downloading fp16 or awq and consuming more disk resources.

Apr 23 '24 17:04 lutemartin

+1 gguf support please!

Apr 26 '24 03:04 micsama

+1 gguf support please

Apr 27 '24 13:04 FarisHijazi

vllm vllm copied to clipboard

GGUF support

Motivation

Related resources

vllm
vllm copied to clipboard