mlx-examples Export the GGUF (fp16) format model weights from fuse.py

Nice to Have

[ ] Support Bpe tokenizer
[ ] Support SentencePiece tokenizer
[ ] export quant mlx model

Completed Tasks

[x] Export fp16 format model weight
[x] Support HF tokenizer

Mar 10 '24 03:03 mzbac

@awni As discussed here https://github.com/ml-explore/mlx/issues/814#issuecomment-1987101265, maybe we should only support converting to gguf fp16 for now. The current implementation is using hf vocab which should be good for the first cut since we are already using autoTokenizer. If there is a need to support BPE tokenizer vocab and SentencePiece tokenizer, we can add it later.

I have tested it locally for llama, mistral, and mixtral. It seems to work as expected.

Mar 10 '24 11:03 mzbac

@mzbac this is ready for review now right?

Mar 13 '24 04:03 awni

@mzbac this is ready for review now right?

Yeah, it kind of works with Llama, Mixtral, and Mistral. However, I am not sure if only providing GGUF FP16 would add that much value since people still need to use llama.cpp to quantize it for more practical usage. Ideally, if we can provide Q4 quantization for GGUF, it may be more beneficial. But I may need some guidance from you on how we can convert MLX weight to GGUF Q4_0 quantization.

Mar 13 '24 04:03 mzbac