GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard
Request: Optional non-CUDA version
Amazing work! Thank you so much for sharing this.
Despite my attempts, I wasn't able to replicate the quantization functions without CUDA. It would be hugely helpful if users could use AMD or Apple Silicon GPUs too (which already have PyTorch support as 'mps'.
Apple Silicon may seem like an odd option, but shared memory means they are some of the only options for high-memory inference on consumer hardware. For example, it is possible to get up to 64gb of GPU-accessible memory on the Mac Studio.
Any code changes or advice to achieve this would be sincerely appreciated!
You need to support custom extensions like CUDA Kernel to get speed and memory benefits. Are similar features supported on AMD or Apple Silicon GPUs by any chance?
Apple recently added good quantization functionality to MPSGraph. I might be able to string something together with PythonKit, to utilize the Swift MPSGraph API alongside whatever Python PyTorch utilities you have created. MPSGraph is an MLIR compiler that lowers API calls down to optimized kernels or even shader code. The latter feature may enable fast O(n) chained operators without writing shaders. It runs fastest when you can avoid eager execution (e.g. PyTorch). Build an entire model graph beforehand, then execute it.
Apple silicon has a large memory coupled with a very high-bandwidth SSD (up to 7.4 GB/s). The Metal API can page unused parts of RAM to the disk, and PyTorch uses that mechanism internally to expand the usable memory to ~1.5x RAM. I have a 32 GB M1 Max with 8.46 TFLOPS effective.
@qwopqwop200 I'd like to run this model for fun, but am not that knowledgeable about the model itself. I've gotten quite skilled at using ChatGPT for research and have prior experience with ML. Can we get this running and make it accessible to Apple silicon users?
It seems related to this. https://github.com/ggerganov/llama.cpp
Yes, but that only uses the AMX. The GPU has more bandwidth and compute power, and accelerates the actual dequantization. Un-batched inference is O(n^2) but AMX only helps O(n^3) problems. For models that don’t exceed RAM, GPU might provide a massive speed up. And for models that do, Metal provides very optimized utilities for streaming parameters from SSD.
I made a prototype: https://gist.github.com/philipturner/4ad866cf537daaedc033acf18e29d65d
Would someone mind checking my results against this repository's implementation?
Results
Expected decompressed weights (UInt4 -> UInt8):
Row 0: [6, 0, 7, 12, 4, 8, 8, 15]
Row 1: [0, 2, 7, 12, 12, 8, 14, 8]
Row 2: [12, 6, 2, 6, 14, 12, 6, 0]
Row 3: [0, 3, 13, 12, 2, 2, 1, 5]
Row 4: [10, 2, 6, 13, 7, 8, 8, 6]
Row 5: [6, 1, 0, 12, 2, 0, 4, 3]
Row 6: [9, 1, 11, 3, 9, 15, 9, 9]
Row 7: [8, 2, 3, 5, 15, 14, 0, 6]
Actual decompressed weights (UInt4 -> UInt8):
Row 0: [6, 0, 7, 12, 4, 8, 8, 15]
Row 1: [0, 2, 7, 12, 12, 8, 14, 8]
Row 2: [12, 6, 2, 6, 14, 12, 6, 0]
Row 3: [0, 3, 13, 12, 2, 2, 1, 5]
Row 4: [10, 2, 6, 13, 7, 8, 8, 6]
Row 5: [6, 1, 0, 12, 2, 0, 4, 3]
Row 6: [9, 1, 11, 3, 9, 15, 9, 9]
Row 7: [8, 2, 3, 5, 15, 14, 0, 6]
Scales for dequantization (UInt8 -> Float32):
Row 0: [0.74349636, 1.465982]
Row 1: [-0.41625232, -0.77464366]
Row 2: [-0.113390885, 1.3622935]
Row 3: [-0.07478706, -0.15279876]
Row 4: [-0.76195055, 1.0117782]
Row 5: [0.4978803, 0.75568324]
Row 6: [0.32307047, 0.64390826]
Row 7: [0.07462493, -0.042125724]
Zeroes for dequantization (UInt8 -> Float32):
Row 0: [1.1470629, 0.7553195]
Row 1: [0.47549996, 0.9139278]
Row 2: [1.0248985, 0.7267543]
Row 3: [-0.8569653, 1.1718377]
Row 4: [-0.058334544, -0.60754406]
Row 5: [-1.3902608, 0.9882344]
Row 6: [0.16975856, -0.41811433]
Row 7: [0.32190362, -0.16722564]
Input (Float32):
[0.7056554, -0.40496466, 1.4785147, -0.26150256, 1.1333572, -1.0250733, 1.3391358, -1.0180573]
Output (Float32):
[2.6232612, -1.5054487, 5.4963512, -0.972131, 4.213235, -3.8106914, 4.978213, -3.7846098]
There are currently no plans to support anything other than Nvidia GPUs. Please use llama.cpp for CPU.
For those that want to run this on AMD GPU's, you can use the relevant parts of this guide. This works on my RX 6800 using oobabooga's web UI, I don't know if quantization is fully supported.
It uses ROCm, which means that it is Linux only, I think.
Almost anything written in CUDA, you can also run on AMD server-class GPUs through their CUDA -> HIP transpiler. But other vendors, such as Intel, Apple, Qualcomm don't have a direct transpiler.