GGUF support
Proposed changes
This adds GGUF support using the excellent gguflib from @antirez.
Would there be interest in this? GGUF is currently very popular for local inference, and there are tons of models available in this format.
This can already load tensors from gguf files, but it automatically casts quantized tensors to float32. It also supports serialization of non quantized weights.
A few things are missing:
- [x] Support to load and save other data types
- [ ] Support to load (and save?) key/value metadata
- [x] Python API
- [ ] Read from and save to a file stream
Checklist
Put an x in the boxes that apply.
- [x] I have read the CONTRIBUTING document
- [x] I have run
pre-commit run --all-filesto format my code / installed pre-commit prior to committing changes - [x] I have added tests that prove my fix is effective or that my feature works
- [x] I have updated the necessary documentation (if needed)
That's great! Thank you so much.
Wow awesome!!
Would there be interest in this?
Definitely.
Is it possible to fetch and build a version of the library with CMake rather than include the whole thing in the source code directly?
Is it possible to fetch and build a version of the library with CMake rather than include the whole thing in the source code directly?
I'm new to CMake, but it should be possible. I managed to make it download the library and compile gguf.cpp, but I was having some issues with linking. I'll give it another shot.
Definitely.
Awesome. Than it's time to implement the missing quantization formats in gguflib :)
@awni, CMake now fetches gguflib from GitHub, so we don't need a copy here.
What's the best path to land this? Do you prefer one big PR with as many features as possible or a small one?
What's the best path to land this? Do you prefer one big PR with as many features as possible or a small one?
Your check list looks reasonable to me. If we make it through that I think we could land this. Are you referring to other features beyond that check list? If there is a good reason to put any of them in a follow on PR that is fine too.
But for example we should not default cast to float32 that's inefficient. I would include types in this PR.
It's working now! This PR shows it can run tinyllama.
Unfortunately, I'll not have much time in the next couple of weeks to continue this. Here's what is missing:
- all quantized weights are cast to float32. Non quantized weights are memcopied.
- Is there a plan to create new dtypes for quantization? Should we simply cast to bfloat16 or float16 instead for now?
- No support for key/value metadata, so we still need separate files for the config and tokenizer.
- What's the preferred API here? We could create a separate function that returns all metadata as a dictionary.
- No support for reading and writing to a stream, just to paths as strings. This may require significant changes to gguflib. At the moment, the whole file is mapped.
No support for key/value metadata, so we still need separate files for the config and tokenizer.
- What's the preferred API here? We could create a separate function that returns all metadata as a dictionary.
I have updated the example to use the official gguf python package (source is in llama.cpp) to read the config and the tokenizer directly from the gguf file.
I think this may already be useful for some people, so perhaps we can merge it as is and do any improvements later. What do you think?
I think this may already be useful for some people, so perhaps we can merge it as is and do any improvements later. What do you think?
I will take a look!
Regarding this PR, can you give a quick status update since your last one? I can jump in possibly later today and pick up from whatever needs work.
I will take a look!
Thanks!
Regarding this PR, can you give a quick status update since your last one?
Certainly. Everything I said is in https://github.com/ml-explore/mlx/pull/350#issuecomment-1877692745 is still valid. The only difference is that I don't think reading and writing the metadata should be a blocker. As mentioned in https://github.com/ml-explore/mlx/pull/350#issuecomment-1881435972, we can already run the model from the gguf file alone. There's no need for a separate config.json or tokenizer.model.
I can jump in possibly later today and pick up from whatever needs work.
I just quit my job and don't have access to a computer with apple silicon at the moment, so feel free to pick this up :)
Just pardon me if I add that I follow closely the scene of Local LLMs on Reddit and other places, and people look with a huge interest to MLX and the main blocker is GGUF support, for many. This will be a huge boost, will create an alternative ecosystem to the (very good) llama.cpp that will benefit everybody. Thanks.
Sent a few diffs, here. It looks in great shape to me.
One thing I find inconsistent with the rest of our save routines is that the open mode is wx as opposed to w. I wonder if we should try to find a workaround for that..
Sent a few diffs, here. It looks in great shape to me.
@awni , Thanks for having a look and for making the improvements!
One thing I find inconsistent with the rest of our save routines is that the open mode is wx as opposed to w. I wonder if we should try to find a workaround for that..
I've proposed a new function to create with overwrite in https://github.com/antirez/gguf-tools/pull/7 and made the change here to use it.
Let's see what @antirez thinks of the api.
@awni I've pushed the changes to use the new gguflib api. Please take a look 🙏 . Thanks @antirez!
I have a couple more questions about the functionality:
Right now we don't support loading any quantized formats natively, but we default to load fp32. I'm wondering if that's the right call 🤔 . If a model has already been quantized, most people would at the very least want fp16 (and of course native quantization support).
I wonder if we should expose that behavior as a config option? Or maybe default to fp16 for quantized formats? Or something else?
Also I'm wondering what it would take to support 4-bit/8-bit quantization natively. That would make this a much more useful addition. I am not familiar with the GGUF quantizers but maybe at least on of the formats matches our linear quantization scheme then it should be relatively straightforward. I don't think we need to block the merge on this, but would be good to know what the options are at least.
@antirez do you happen to know where to get more details on the meaning of the quantization formats? It's quite difficult to interpret these acronyms in the GGUF docs:
MOSTLY_Q4_0 = 2
MOSTLY_Q4_1 = 3
MOSTLY_Q4_1_SOME_F16 = 4
MOSTLY_Q4_2 = 5 (support removed)
MOSTLY_Q4_3 = 6 (support removed)
MOSTLY_Q8_0 = 7
MOSTLY_Q5_0 = 8
MOSTLY_Q5_1 = 9
MOSTLY_Q2_K = 10
MOSTLY_Q3_K_S = 11
MOSTLY_Q3_K_M = 12
MOSTLY_Q3_K_L = 13
MOSTLY_Q4_K_S = 14
MOSTLY_Q4_K_M = 15
MOSTLY_Q5_K_S = 16
MOSTLY_Q5_K_M = 17
MOSTLY_Q6_K = 18
@awni one of the main contributions I tried to provide with gguflib was documenting the format. AFAIK the only description is in the comments of gguflib here: https://github.com/antirez/gguf-tools/blob/main/gguflib.c#L529
There you will find very detailed explanations of the exact data layout. I obtained those description by reading the gguf code in llama.cpp, then re-implemented the functions in simpler (probably slower) forms.
There you will find very detailed explanations of the exact data layout.
This is exactly what I am looking for thank you 🙏
P.S. if the goal is to show the final user loading the model the GGUF actual quantization, I would not trust much the "MOSTLY_" defines stated in the GGUF file headers. Perhaps it's better to scan with gguflib the whole set of weights, and provide the quantization in percentage, like 82% Q4_K, 5% F16, ... since often there is some mismatch and also the "mostly" thing does not provide exact information when it is important to know what was quantized and want not.
I wonder if we should expose that behavior as a config option? Or maybe default to fp16 for quantized formats? Or something else?
Defaulting to fp16 makes sense to me. I considering implementing it, but gguflib only supports casting to fp32, and casting twice is not ideal.
Also I'm wondering what it would take to support 4-bit/8-bit quantization natively. That would make this a much more useful addition. I am not familiar with the GGUF quantizers but maybe at least on of the formats matches our linear quantization scheme then it should be relatively straightforward.
One issue here is that MLX layers are quantization-aware, whereas in GGUF the tensors are quantization aware, which seems like a more general approach to me.
Would it make sense to introduce new dtypes for different quantization algorithms?
If you want a to_f16 method, I can do it right now.
If you want a to_f16 method, I can do it right now.
I think that would be great!
Would it make sense to introduce new dtypes for different quantization algorithms?
We thought about this when implementing our quantization and decided against it. That level of type proliferation does not well with MLX...
One issue here is that MLX layers are quantization-aware, whereas in GGUF the tensors are quantization aware, which seems like a more general approach to me.
I think we could solve this by breaking the GGUF tensor into the appropriate MLX tensors. For example our quantized matmul uses three tensors (matrix + scales + biases) the equivalent version would presumably packed into a single GGUF tensor. There would of course be some work required at the top level to make sure you are using the right quantization layers etc, but that should be simple enough using a config.
We thought about this when implementing our quantization and decided against it.
Got it, thanks.
I think we could solve this by breaking the GGUF tensor into the appropriate MLX tensors. For example our quantized matmul uses three tensors (matrix + scales + biases) the equivalent version would presumably packed into a single GGUF tensor.
Sounds like a viable path!
From antirez implementation, I think Q8_0 is compatible with this (raw weight, scales, biases) scheme. The other types have super blocks and subblocks, with scales that are also quantized.
We could make gguf.c automatically split quantization strategies that are compatible into multiple tensors, but we may need to store metadata such as the block size somewhere (perhaps another tensor of size [1]?). From that, shouldn't be hard to initialize the equivalent QuantizedLinear layers.
PS: If I'm reading this correctly, Q4_0 and Q4_1 are also compatible (Q4_0 has scales and no biases, Q4_1 has both).
Could not test (I have yet to generate meaningful GGUF files for testing), but here it is -> https://github.com/antirez/gguf-tools/commit/eec3dc9f544e818c39ba3f3500c7a7e8eb6e29c9
I tried to change the code so that in theory, in the future we could add methods to dequantize directly to formats supported by MLX or other frameworks. There is basically an optional callback and the functions can store directly in one pass to the target type. However as it is currently, given that normally those formats are organized in blocks, the callback would be required to do some math with the "idx" to understand where it is in the block, but shouldn't be too hard.
PS: If I'm reading this correctly, Q4_0 and Q4_1 are also compatible (Q4_0 has scales and no biases, Q4_1 has both).
Looks like that, I didn't yet implemented those, but if there aren't problems with the way the quants are ordered, probably it's compatible. However there is always the option to dequantize them directly to an MLX supported quantization format.
This is great, @antirez ! Thank you. With the callback, we can easily cast to bfloat16, which may be a better default for MLX.
This is great, @antirez ! Thank you. With the callback, we can easily cast to bfloat16, which may be a better default for MLX.
Sure! I'll try to add bfloat16 as well.