Juarez Bochi
Juarez Bochi
It's working now! This [PR](https://github.com/ml-explore/mlx-examples/pull/222) shows it can run tinyllama. Unfortunately, I'll not have much time in the next couple of weeks to continue this. Here's what is missing: -...
> No support for key/value metadata, so we still need separate files for the config and tokenizer. > * What's the preferred API here? We could create a separate function...
> I will take a look! Thanks! > Regarding this PR, can you give a quick status update since your last one? Certainly. Everything I said is in https://github.com/ml-explore/mlx/pull/350#issuecomment-1877692745 is...
> Sent a few diffs, here. It looks in great shape to me. @awni , Thanks for having a look and for making the improvements! > One thing I find...
@awni I've pushed the changes to use the new gguflib api. Please take a look 🙏 . Thanks @antirez!
> I wonder if we should expose that behavior as a config option? Or maybe default to fp16 for quantized formats? Or something else? Defaulting to fp16 makes sense to...
> If you want a to_f16 method, I can do it right now. I think that would be great!
> We thought about this when implementing our quantization and decided against it. Got it, thanks. > I think we could solve this by breaking the GGUF tensor into the...
PS: If I'm reading [this](https://github.com/ggerganov/llama.cpp/blob/468ea24fb4633a0d681f7ac84089566c1c6190cb/ggml.c#L1525-L1565) correctly, Q4_0 and Q4_1 are also compatible (Q4_0 has scales and no biases, Q4_1 has both).
This is great, @antirez ! Thank you. With the callback, we can easily cast to bfloat16, which may be a better default for MLX.