vllm
vllm copied to clipboard
GGUF support
Motivation
AWQ is nice, but if you want more control over the bit depth (thus VRAM usage), then GGUF may be a better option. A wide range of models are available from TheBloke at various bit depths, so everybody can use the biggest one which can fit into their GPUs.
I cannot find a high-throughput batch inference engine which can load GGUF, maybe there is none. (vLLM cannot load it either.)
Related resources
https://github.com/ggerganov/llama.cpp
https://huggingface.co/TheBloke
FYI, high throughput is hard when using quantized models in general, regardless of which framework. But if you can manage to run with batch size (data parallelism) of less than 8 or ideally less than 4, then it will increase throughput. At 16, there is no gain with quantized models because of how quantized inference works.
Does the AWQ implementation support higher than 4 bits per weight, for example 8 bits?
Does the AWQ implementation support higher than 4 bits per weight, for example 8 bits?
Not yet. It’s 4-bit only at the moment.
Thank you.
Could you please point me to some technical details what makes it hard to implement high throughput (batching, caching) and using quantization (unpacking quantized data on demand) at the same time? These seem to be pretty orthogonal for me.
I'm learning into the LLM (transformer) implementations and have past coding experience. Therefore I'm really interested in knowing the details.
Thank you.
Could you please point me to some technical details what makes it hard to implement high throughput (batching, caching) and using quantization (unpacking quantized data on demand) at the same time? These seem to be pretty orthogonal for me.
I'm learning into the LLM (transformer) implementations and have past coding experience. Therefore I'm really interested in knowing the details.
Yes. The way quantization works is that weights are quantized to 4 bits. Then at inference time, you run dequantization to FP16 to be able to perform matrix multiplication. This dequantization is the essence of any quantized model and is why quantized models generally struggle with large batch sizes because it becomes compute-bound doing dequantization rather than the actual matrix multiplication. At batch size 1, you are memory bound, which speeds up your inference by a great deal.
On your documentation page there is an excellent high level summary on how to add support for a new model.
Could you please write down a few bullet points on where to look in the code (high level) if I want to add a new input format (GGUF)?
I would start by searching for all the code involved in reading model configuration and parameters.
I'm also aware that for 8 bit operation at least the dequantization code will need to be extended as well.
Also, what do you think about AWQ? Should we expect 8 bit support be added to AWQ in the near future? Because that would make any work on GGUF pointless.
If I would work on GGUF support it would certainly be based on the AWQ branch. Which is the correct branch to look at? (There seem to be multiple and I'm a bit confused.)
Thank you!
This would be a good starting point: https://github.com/oobabooga/text-generation-webui/blob/9331ab4798f392ee0634a4194a6bae370afd435f/modules/metadata_gguf.py
https://github.com/oobabooga/text-generation-webui/commit/9331ab4798f392ee0634a4194a6bae370afd435f#diff-4f7989a17db10e843396d051093357685cd8b4f69e186734e5be93e293689ae8
8-bit GGUF could be a perfect input format for the upcoming W8A8 inference mode, see #1112
Thank you. Could you please point me to some technical details what makes it hard to implement high throughput (batching, caching) and using quantization (unpacking quantized data on demand) at the same time? These seem to be pretty orthogonal for me. I'm learning into the LLM (transformer) implementations and have past coding experience. Therefore I'm really interested in knowing the details.
Yes. The way quantization works is that weights are quantized to 4 bits. Then at inference time, you run dequantization to FP16 to be able to perform matrix multiplication. This dequantization is the essence of any quantized model and is why quantized models generally struggle with large batch sizes because it becomes compute-bound doing dequantization rather than the actual matrix multiplication. At batch size 1, you are memory bound, which speeds up your inference by a great deal.
Could you please elaborate more on this or point to some code? At first glance, the amount of work is proportional to a batch size in both cases(dequantization, matrix-multiplication). Or at larger batches, matrix multiplication is implemented more efficiently (with some CUDA tricks) while dequantization stays the same?
The amount of work scales linearly. The problem is when you increase batch size too much because then your GPU will be 100% utilized just doing matrix multiplication. Once that happens, the dequantization overhead will start showing in how fast you can run inference compared to FP16.
This is also why finding algorithms that can quantize to W4A4 (very hard) or W8A8 (hard) is essential for higher throughput because you remove the need for dequantization since you can run natively on tensor cores.
I understand the motivation of W4A4 and W8A8 as everything can be done solely in INT4/INT8.
But what I don't fully understand is the following
load weights in fp16 -> matrix multiplication in fp16
load weights in int4 or int8 -> dequantization to fp16 -> matrix multiplication in fp16
if all the above blocks scale linearly(performance-wise) with batch size the second scenario must always be faster for any batch size if it's faster for batch_size=1
. So I expect some non-linearity in one of the steps.
This is the difference between memory-bound and compute-bound. At small batch sizes, you are memory bound meaning that you are limited by how fast you can pass the model’s weights around. This makes quantized models faster.
However, when we talk about large batch sizes, we move away from being memory bound. It is no longer an issue of how fast we can transport the weights but rather how much time we spend doing computations. You have to think of it as being stuck computing matmuls rather than waiting for weights to be transported through memory.
@casper-hansen thanks for the clarification. I'm still trying to connect the dots.
Is loading->dequantization->multiplication
fuzed into a single kernel? Could you point me to some source code?
Weight loading happens at startup time and then it’s transported through registers. This process is not really transparent but it all happens in the quantization kernel that you can find in csrc.
no GGUF support?
+1 GGUF
+1 for gguf please
Slowly we should go for EXL2 instead :)
Over the past two weeks, while I was learning the llama.cpp code and simultaneously writing a small repo running GGUF inference in Pytorch, I thought it would be better to make vLLM work too. Through a process of trial and error, I've managed to develop a preliminary draft of the GGUF support, which you can find in the gguf branch. As of now, It only works for llama and mixtral.
First convert the gguf to torch state dict and tokenizer file using the code in the examples
folder
PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python python convert_gguf_to_torch.py --input mixtral-8x7b-instruct-v0.1.Q2_K.gguf --output mixtral-q2k
Then start up the vllm server as usual
python -m vllm.entrypoints.api_server --model mixtral-q2k --quantization gguf
Note:
- llama.cpp quantizes both embedding layers and output layers. For simplicity now I de-quantized them before loading into model, but a more decent solution to load them natively is definitely needed.
- llama.cpp implements two sets of kernels, WxA8 (see #2067 and #2160) and WxA16. I haven't read the WxA8 cuda code yet and only ported the WxA16 part now, so the latency may be inferior.
Great job, I will definitely try it as time allows.
My primary use case is running DeepSeek Coder 33B (alternatively CodeLlama 34B) with --tensor-parallel=2
.
The A16 likely has better quality (correctness) anyway, so that's a good first choice.
If this approach works well with GGUF, then supporting the EXL2 format may work as well.
+1
+1
I made a few updates and moved it to the default branch. Quantized embedding layers and output layers are added, as well as the QxW8 kernels. However the performance improvement over QxA16 seems marginal. I also made the gguf-to-torch conversion implicit, so it's easier to use now:
python -m vllm.entrypoints.api_server --model miqu-1-70b.q2_K.gguf
The single request latency is slightly lower than llama.cpp. The packing of GGUF is very unfriendly for GPU memory access, making it slower than other quant methods. I haven't found a way to measure throughput using llama.cpp server, so no comparison for throughput yet. I'll try making it into a formal PR later.
I made a few updates and moved it to the default branch. Quantized embedding layers and output layers are added, as well as the QxW8 kernels. However the performance improvement over QxA16 seems marginal. I also made the gguf-to-torch conversion implicit, so it's easier to use now:
python -m vllm.entrypoints.api_server --model miqu-1-70b.q2_K.gguf
The single request latency is slightly lower than llama.cpp. The packing of GGUF is very unfriendly for GPU memory access, making it slower than other quant methods. I haven't found a way to measure throughput using llama.cpp server, so no comparison for throughput yet. I'll try making it into a formal PR later.
can you please check if vllm can inter miqu-1-70b-sf-gptq correctly? it reports oom on my machine. (other 70b gptq model like qwen are just ok)
I have been trying the below command with vLLM version 0.3.0 in Linux Ubuntu CPU machine.
python3 -m vllm.entrypoints.api_server --root-path models/ --model llama-2-7b.Q5_K_M.gguf --host 0.0.0.0 --port 8080
Facing the error as below, any help please.
File "/home/ubuntu/ragas/lib/python3.10/site-packages/transformers/utils/hub.py", line 406, in cached_file
raise EnvironmentError(
OSError: llama-2-7b.Q5_K_M.gguf is not a local folder and is not a valid model identifier listed on 'https://hugg
ingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in wi
th huggingface-cli login
or by passing token=<your_token>
I made a few updates and moved it to the default branch. Quantized embedding layers and output layers are added, as well as the QxW8 kernels. However the performance improvement over QxA16 seems marginal. I also made the gguf-to-torch conversion implicit, so it's easier to use now:
python -m vllm.entrypoints.api_server --model miqu-1-70b.q2_K.gguf
The single request latency is slightly lower than llama.cpp. The packing of GGUF is very unfriendly for GPU memory access, making it slower than other quant methods. I haven't found a way to measure throughput using llama.cpp server, so no comparison for throughput yet. I'll try making it into a formal PR later.
hi, is it support for mistral gguf?. just tried but got error
OSError: It looks like the config file at '../models/FL1-base-7B-Q8_0.gguf' is not a valid JSON file.
I made a few updates and moved it to the default branch. Quantized embedding layers and output layers are added, as well as the QxW8 kernels. However the performance improvement over QxA16 seems marginal. I also made the gguf-to-torch conversion implicit, so it's easier to use now:
python -m vllm.entrypoints.api_server --model miqu-1-70b.q2_K.gguf
The single request latency is slightly lower than llama.cpp. The packing of GGUF is very unfriendly for GPU memory access, making it slower than other quant methods. I haven't found a way to measure throughput using llama.cpp server, so no comparison for throughput yet. I'll try making it into a formal PR later.
hi, is it support for mistral gguf?. just tried but got error
OSError: It looks like the config file at '../models/FL1-base-7B-Q8_0.gguf' is not a valid JSON file.
Yes I tested mistral gguf with no problem. Please be ware that you have to install the custom branch from source instead of the official build. As an alternative, you can use aphrodite-engine too, which also integrates GGUF support and is easier to install.
+1 gguf support please!
For those of us that have downloaded a large archive of gguf models, it would be a great benefit to use the vLLM project with the artifacts we have already downloaded and available, rather than downloading fp16 or awq and consuming more disk resources.
+1 gguf support please!
For those of us that have downloaded a large archive of gguf models, it would be a great benefit to use the vLLM project with the artifacts we have already downloaded and available, rather than downloading fp16 or awq and consuming more disk resources.
+1 gguf support please