mlx icon indicating copy to clipboard operation
mlx copied to clipboard

GGUF support

Open jbochi opened this issue 1 year ago • 11 comments

Proposed changes

This adds GGUF support using the excellent gguflib from @antirez.

Would there be interest in this? GGUF is currently very popular for local inference, and there are tons of models available in this format.

This can already load tensors from gguf files, but it automatically casts quantized tensors to float32. It also supports serialization of non quantized weights.

A few things are missing:

  • [x] Support to load and save other data types
  • [ ] Support to load (and save?) key/value metadata
  • [x] Python API
  • [ ] Read from and save to a file stream

Checklist

Put an x in the boxes that apply.

  • [x] I have read the CONTRIBUTING document
  • [x] I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • [x] I have added tests that prove my fix is effective or that my feature works
  • [x] I have updated the necessary documentation (if needed)

jbochi avatar Jan 03 '24 13:01 jbochi

That's great! Thank you so much.

antirez avatar Jan 03 '24 13:01 antirez

Wow awesome!!

Would there be interest in this?

Definitely.

Is it possible to fetch and build a version of the library with CMake rather than include the whole thing in the source code directly?

awni avatar Jan 03 '24 14:01 awni

Is it possible to fetch and build a version of the library with CMake rather than include the whole thing in the source code directly?

I'm new to CMake, but it should be possible. I managed to make it download the library and compile gguf.cpp, but I was having some issues with linking. I'll give it another shot.

jbochi avatar Jan 03 '24 14:01 jbochi

Definitely.

Awesome. Than it's time to implement the missing quantization formats in gguflib :)

antirez avatar Jan 03 '24 14:01 antirez

@awni, CMake now fetches gguflib from GitHub, so we don't need a copy here.

What's the best path to land this? Do you prefer one big PR with as many features as possible or a small one?

jbochi avatar Jan 03 '24 16:01 jbochi

What's the best path to land this? Do you prefer one big PR with as many features as possible or a small one?

Your check list looks reasonable to me. If we make it through that I think we could land this. Are you referring to other features beyond that check list? If there is a good reason to put any of them in a follow on PR that is fine too.

But for example we should not default cast to float32 that's inefficient. I would include types in this PR.

awni avatar Jan 03 '24 16:01 awni

It's working now! This PR shows it can run tinyllama.

Unfortunately, I'll not have much time in the next couple of weeks to continue this. Here's what is missing:

  • all quantized weights are cast to float32. Non quantized weights are memcopied.
    • Is there a plan to create new dtypes for quantization? Should we simply cast to bfloat16 or float16 instead for now?
  • No support for key/value metadata, so we still need separate files for the config and tokenizer.
    • What's the preferred API here? We could create a separate function that returns all metadata as a dictionary.
  • No support for reading and writing to a stream, just to paths as strings. This may require significant changes to gguflib. At the moment, the whole file is mapped.

jbochi avatar Jan 04 '24 20:01 jbochi

No support for key/value metadata, so we still need separate files for the config and tokenizer.

  • What's the preferred API here? We could create a separate function that returns all metadata as a dictionary.

I have updated the example to use the official gguf python package (source is in llama.cpp) to read the config and the tokenizer directly from the gguf file.

I think this may already be useful for some people, so perhaps we can merge it as is and do any improvements later. What do you think?

jbochi avatar Jan 08 '24 16:01 jbochi

I think this may already be useful for some people, so perhaps we can merge it as is and do any improvements later. What do you think?

I will take a look!

Regarding this PR, can you give a quick status update since your last one? I can jump in possibly later today and pick up from whatever needs work.

awni avatar Jan 08 '24 16:01 awni

I will take a look!

Thanks!

Regarding this PR, can you give a quick status update since your last one?

Certainly. Everything I said is in https://github.com/ml-explore/mlx/pull/350#issuecomment-1877692745 is still valid. The only difference is that I don't think reading and writing the metadata should be a blocker. As mentioned in https://github.com/ml-explore/mlx/pull/350#issuecomment-1881435972, we can already run the model from the gguf file alone. There's no need for a separate config.json or tokenizer.model.

I can jump in possibly later today and pick up from whatever needs work.

I just quit my job and don't have access to a computer with apple silicon at the moment, so feel free to pick this up :)

jbochi avatar Jan 08 '24 16:01 jbochi

Just pardon me if I add that I follow closely the scene of Local LLMs on Reddit and other places, and people look with a huge interest to MLX and the main blocker is GGUF support, for many. This will be a huge boost, will create an alternative ecosystem to the (very good) llama.cpp that will benefit everybody. Thanks.

antirez avatar Jan 08 '24 18:01 antirez

Sent a few diffs, here. It looks in great shape to me.

One thing I find inconsistent with the rest of our save routines is that the open mode is wx as opposed to w. I wonder if we should try to find a workaround for that..

awni avatar Jan 09 '24 05:01 awni

Sent a few diffs, here. It looks in great shape to me.

@awni , Thanks for having a look and for making the improvements!

One thing I find inconsistent with the rest of our save routines is that the open mode is wx as opposed to w. I wonder if we should try to find a workaround for that..

I've proposed a new function to create with overwrite in https://github.com/antirez/gguf-tools/pull/7 and made the change here to use it.

Let's see what @antirez thinks of the api.

jbochi avatar Jan 09 '24 13:01 jbochi

@awni I've pushed the changes to use the new gguflib api. Please take a look 🙏 . Thanks @antirez!

jbochi avatar Jan 09 '24 15:01 jbochi

I have a couple more questions about the functionality:

Right now we don't support loading any quantized formats natively, but we default to load fp32. I'm wondering if that's the right call 🤔 . If a model has already been quantized, most people would at the very least want fp16 (and of course native quantization support).

I wonder if we should expose that behavior as a config option? Or maybe default to fp16 for quantized formats? Or something else?

Also I'm wondering what it would take to support 4-bit/8-bit quantization natively. That would make this a much more useful addition. I am not familiar with the GGUF quantizers but maybe at least on of the formats matches our linear quantization scheme then it should be relatively straightforward. I don't think we need to block the merge on this, but would be good to know what the options are at least.

awni avatar Jan 09 '24 16:01 awni

@antirez do you happen to know where to get more details on the meaning of the quantization formats? It's quite difficult to interpret these acronyms in the GGUF docs:

MOSTLY_Q4_0 = 2
MOSTLY_Q4_1 = 3
MOSTLY_Q4_1_SOME_F16 = 4
MOSTLY_Q4_2 = 5 (support removed)
MOSTLY_Q4_3 = 6 (support removed)
MOSTLY_Q8_0 = 7
MOSTLY_Q5_0 = 8
MOSTLY_Q5_1 = 9
MOSTLY_Q2_K = 10
MOSTLY_Q3_K_S = 11
MOSTLY_Q3_K_M = 12
MOSTLY_Q3_K_L = 13
MOSTLY_Q4_K_S = 14
MOSTLY_Q4_K_M = 15
MOSTLY_Q5_K_S = 16
MOSTLY_Q5_K_M = 17
MOSTLY_Q6_K = 18

awni avatar Jan 09 '24 16:01 awni

@awni one of the main contributions I tried to provide with gguflib was documenting the format. AFAIK the only description is in the comments of gguflib here: https://github.com/antirez/gguf-tools/blob/main/gguflib.c#L529

There you will find very detailed explanations of the exact data layout. I obtained those description by reading the gguf code in llama.cpp, then re-implemented the functions in simpler (probably slower) forms.

antirez avatar Jan 09 '24 16:01 antirez

There you will find very detailed explanations of the exact data layout.

This is exactly what I am looking for thank you 🙏

awni avatar Jan 09 '24 16:01 awni

P.S. if the goal is to show the final user loading the model the GGUF actual quantization, I would not trust much the "MOSTLY_" defines stated in the GGUF file headers. Perhaps it's better to scan with gguflib the whole set of weights, and provide the quantization in percentage, like 82% Q4_K, 5% F16, ... since often there is some mismatch and also the "mostly" thing does not provide exact information when it is important to know what was quantized and want not.

antirez avatar Jan 09 '24 16:01 antirez

I wonder if we should expose that behavior as a config option? Or maybe default to fp16 for quantized formats? Or something else?

Defaulting to fp16 makes sense to me. I considering implementing it, but gguflib only supports casting to fp32, and casting twice is not ideal.

Also I'm wondering what it would take to support 4-bit/8-bit quantization natively. That would make this a much more useful addition. I am not familiar with the GGUF quantizers but maybe at least on of the formats matches our linear quantization scheme then it should be relatively straightforward.

One issue here is that MLX layers are quantization-aware, whereas in GGUF the tensors are quantization aware, which seems like a more general approach to me.

Would it make sense to introduce new dtypes for different quantization algorithms?

jbochi avatar Jan 09 '24 16:01 jbochi

If you want a to_f16 method, I can do it right now.

antirez avatar Jan 09 '24 16:01 antirez

If you want a to_f16 method, I can do it right now.

I think that would be great!

jbochi avatar Jan 09 '24 17:01 jbochi

Would it make sense to introduce new dtypes for different quantization algorithms?

We thought about this when implementing our quantization and decided against it. That level of type proliferation does not well with MLX...

One issue here is that MLX layers are quantization-aware, whereas in GGUF the tensors are quantization aware, which seems like a more general approach to me.

I think we could solve this by breaking the GGUF tensor into the appropriate MLX tensors. For example our quantized matmul uses three tensors (matrix + scales + biases) the equivalent version would presumably packed into a single GGUF tensor. There would of course be some work required at the top level to make sure you are using the right quantization layers etc, but that should be simple enough using a config.

awni avatar Jan 09 '24 17:01 awni

We thought about this when implementing our quantization and decided against it.

Got it, thanks.

I think we could solve this by breaking the GGUF tensor into the appropriate MLX tensors. For example our quantized matmul uses three tensors (matrix + scales + biases) the equivalent version would presumably packed into a single GGUF tensor.

Sounds like a viable path!

From antirez implementation, I think Q8_0 is compatible with this (raw weight, scales, biases) scheme. The other types have super blocks and subblocks, with scales that are also quantized.

We could make gguf.c automatically split quantization strategies that are compatible into multiple tensors, but we may need to store metadata such as the block size somewhere (perhaps another tensor of size [1]?). From that, shouldn't be hard to initialize the equivalent QuantizedLinear layers.

jbochi avatar Jan 09 '24 17:01 jbochi

PS: If I'm reading this correctly, Q4_0 and Q4_1 are also compatible (Q4_0 has scales and no biases, Q4_1 has both).

jbochi avatar Jan 09 '24 17:01 jbochi

Could not test (I have yet to generate meaningful GGUF files for testing), but here it is -> https://github.com/antirez/gguf-tools/commit/eec3dc9f544e818c39ba3f3500c7a7e8eb6e29c9

antirez avatar Jan 09 '24 17:01 antirez

I tried to change the code so that in theory, in the future we could add methods to dequantize directly to formats supported by MLX or other frameworks. There is basically an optional callback and the functions can store directly in one pass to the target type. However as it is currently, given that normally those formats are organized in blocks, the callback would be required to do some math with the "idx" to understand where it is in the block, but shouldn't be too hard.

antirez avatar Jan 09 '24 17:01 antirez

PS: If I'm reading this correctly, Q4_0 and Q4_1 are also compatible (Q4_0 has scales and no biases, Q4_1 has both).

Looks like that, I didn't yet implemented those, but if there aren't problems with the way the quants are ordered, probably it's compatible. However there is always the option to dequantize them directly to an MLX supported quantization format.

antirez avatar Jan 09 '24 18:01 antirez

This is great, @antirez ! Thank you. With the callback, we can easily cast to bfloat16, which may be a better default for MLX.

jbochi avatar Jan 09 '24 18:01 jbochi

This is great, @antirez ! Thank you. With the callback, we can easily cast to bfloat16, which may be a better default for MLX.

Sure! I'll try to add bfloat16 as well.

antirez avatar Jan 09 '24 19:01 antirez