GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard
[Request] Mixed Precission Quantization
I believe that we can achieve further optimisation beyond even 4bit quantization with selective quantization of specifically chosen layers down to 2bits.
See: https://arxiv.org/abs/2203.08368
By selectively quantizing 50% of the layers down to 2bits, it may even be possible to run 65B Llama on a 24GB VRAM gpu.
I don't know precisely which layers would work best, (it may be an arduous process of trial and error). Perhaps the best thing to do would be to let the user specify which level of quantization they desire for each layer.
4bit is not the end of the road.
That scheme requires detecting which layers to train during the training process. We could try to fine-tune and test certain layer combinations for LLaMa-65B. I have a 32 GB Apple GPU and a good algorithm for evaluating the model without CUDA. My fast SSD could also pad the slight >0.5 GB memory overrun. Or I start with 3 bits (which provides acceptable performance at 65B params) and consume 24.375 GB of memory.
We could also vary the quantization level within each layer. That may require a prefix sum within the shader, but it's theoretically possible to cache intermediate sums while achieving a net positive compression ratio. Finally, we could try lossless compression, decoding with a complex algorithm in the GPU shader.
Wouldn't it be hard to do anything useful with the remaining VRAM, though? Fitting the model weights on the GPU is one thing, but to run inference you need quite a bit of VRAM on top for the key/value cache. You could theoretically dump the cache after each layer is processed, but then you're redoing the whole computation from scratch with every new token, and at that point you might as well run the whole inference pass on the CPU.
E.g. the 30B 4-bit model does fit on a 24GB GPU, using some 16 GB or so of VRAM. With that you have room for a sequence length of maybe 600 tokens before memory starts running out. Quantizing half the parameters of the 60B model further down to two bits would leave you with basically no room for a prompt, let alone a response.
If your GPU has a direct path to the disk, and a fast SSD, the SSD bandwidth could pad the memory overflow. I calculated that on my M1 Max MBP, it could maintain the theoretical minimum latency with a few hundred MB of overfill.
If you're training, you can afford slightly more overfill by using larger batch sizes. That approach would get more information learned in the same time, given fixed memory pressure.
but to run inference you need quite a bit of VRAM on top for the key/value cache
Can you quantify how large the cache is (in bytes)? I need exact numbers, accurate within a factor of ~0.5-2.0.
In elements:
2 * num_layers * batch_size * num_attn_heads * key_value_dim * seq_len = 2 * num_layers * batch_size * hidden_dim * seq_len
So for half precision, multiply the whole thing by two again. The 60B model has 80 layers and a hidden dimension of 8192, for reference, so it should work out to 2560 kB per token if my math is right. (EDIT: Looked at the 30B numbers before. This should be correct now.)
That's the theoretical minimum amount of data (not counting any overhead from stride etc.) that you would have to pass between iterations if you want anything resembling speed.
As for the computation itself, I guess it depends. I'm looking into it in depth right now because I want to try to achieve a higher sequence length with the 30B model on a 24 GB GPU, and I'm not convinced the Transformers implementation is all that efficient with VRAM usage.
Regardless, there's for sure some intermediate processing you also need to take into account where the query and key matrices are multiplied to produce the attention score matrix, which scales quadratically with the sequence length. It's quite a few big matrices that have to exist in VRAM at the same time.
In any case streaming to an SSD sounds both much too slow and like a good way to burn out an SSD. (?)
In any case streaming to an SSD sounds both much too slow and like a good way to burn out an SSD. (?)
The SSD burns out if you perform enough write operations - ~1000-3000 per bit over its lifetime. That says nothing about durability against read operations.
The workflow would use something like Metal fast resource loading, DirectStorage, or GPUDirect. You carefully set up a certain streaming workflow, where two of 60 layers are not held in memory. For each token in sequence, you load the 29th layer (L29) while L0-L28 are executing. Then execute L30, discard its contents, and start loading L59. It should also arrive just in time, after L30-L58 finish executing.
This is an incremental gain, but it could be the difference between "fits" and "not fits" for a particular model. You can also page (L28, L29) <-> (L58, L59) or (L27, L28, L29) <-> (L57, L58, L59) to save more memory. Divide SSD bandwidth / RAM bandwidth, you get the ideal proportion of paged layers. The examples in this paragraph are (2 + 2) / 60 and (3 + 3) / 60 respectively.
Silly me, I was thinking about swapping state in and out of VRAM. Of course you meant streaming just weights, which would be read only. I can't see why that wouldn't work.
As for the practical memory requirements, I did a few tests on 30B, measuring max memory allocation for a single inference step on different context lengths:
(before inf.): 17562.44 MB
256 tokens: 17677.23 MB
512 tokens: 18584.57 MB
768 tokens: 18698.00 MB
1024 tokens: 18810.15 MB
1280 tokens: 18922.09 MB
1536 tokens: 19034.63 MB
1792 tokens: 19149.66 MB
2048 tokens: 19260.77 MB
There's an odd bump at around 500 tokens, which I can only think has to do with PyTorch switching to a different memory allocation strategy for tensors over a certain size. I need to investigate that. In the meantime, I did it again in finer steps just to confirm, and it came out looking like this:
https://i.imgur.com/6nWWIYA.png
It's very odd. But in any case, it seems that that with the model as is and with a simple, cached forward pass you would need an extra 1.7 GB of available VRAM on top of the weights to make use of the max sequence length of LLaMA-30B. I can't run the 60B model, but I would expect it to take up 64% more space (going by the layer count and hidden dim).
Beam search would of course increase it a lot.
There are currently no plans to support any quantization other than GPTQ. Also, according to my experience so far, 4-bit quantization was the most efficient.