Juarez Bochi comments

Results 34 comments of


                                            Juarez Bochi

GGUF support

It's working now! This [PR](https://github.com/ml-explore/mlx-examples/pull/222) shows it can run tinyllama. Unfortunately, I'll not have much time in the next couple of weeks to continue this. Here's what is missing: -...

GGUF support

> No support for key/value metadata, so we still need separate files for the config and tokenizer. > * What's the preferred API here? We could create a separate function...

GGUF support

> I will take a look! Thanks! > Regarding this PR, can you give a quick status update since your last one? Certainly. Everything I said is in https://github.com/ml-explore/mlx/pull/350#issuecomment-1877692745 is...

GGUF support

> Sent a few diffs, here. It looks in great shape to me. @awni , Thanks for having a look and for making the improvements! > One thing I find...

GGUF support

@awni I've pushed the changes to use the new gguflib api. Please take a look 🙏 . Thanks @antirez!

GGUF support

> I wonder if we should expose that behavior as a config option? Or maybe default to fp16 for quantized formats? Or something else? Defaulting to fp16 makes sense to...

GGUF support

> If you want a to_f16 method, I can do it right now. I think that would be great!

GGUF support

> We thought about this when implementing our quantization and decided against it. Got it, thanks. > I think we could solve this by breaking the GGUF tensor into the...

GGUF support

PS: If I'm reading [this](https://github.com/ggerganov/llama.cpp/blob/468ea24fb4633a0d681f7ac84089566c1c6190cb/ggml.c#L1525-L1565) correctly, Q4_0 and Q4_1 are also compatible (Q4_0 has scales and no biases, Q4_1 has both).

GGUF support

This is great, @antirez ! Thank you. With the callback, we can easily cast to bfloat16, which may be a better default for MLX.