mlx-examples icon indicating copy to clipboard operation
mlx-examples copied to clipboard

Export the GGUF (fp16) format model weights from fuse.py

Open mzbac opened this issue 1 year ago • 3 comments

Nice to Have

  • [ ] Support Bpe tokenizer
  • [ ] Support SentencePiece tokenizer
  • [ ] export quant mlx model

Completed Tasks

  • [x] Export fp16 format model weight
  • [x] Support HF tokenizer

mzbac avatar Mar 10 '24 03:03 mzbac

@awni As discussed here https://github.com/ml-explore/mlx/issues/814#issuecomment-1987101265, maybe we should only support converting to gguf fp16 for now. The current implementation is using hf vocab which should be good for the first cut since we are already using autoTokenizer. If there is a need to support BPE tokenizer vocab and SentencePiece tokenizer, we can add it later.

I have tested it locally for llama, mistral, and mixtral. It seems to work as expected.

mzbac avatar Mar 10 '24 11:03 mzbac

@mzbac this is ready for review now right?

awni avatar Mar 13 '24 04:03 awni

@mzbac this is ready for review now right?

Yeah, it kind of works with Llama, Mixtral, and Mistral. However, I am not sure if only providing GGUF FP16 would add that much value since people still need to use llama.cpp to quantize it for more practical usage. Ideally, if we can provide Q4 quantization for GGUF, it may be more beneficial. But I may need some guidance from you on how we can convert MLX weight to GGUF Q4_0 quantization.

mzbac avatar Mar 13 '24 04:03 mzbac