Add Chatterbox
I've ported Chatterbox to MLX in Python, and it's working well. I haven't uploaded the weights to Hugging Face yet, in case any adjustments need to be made.
The 4-bit quantized model is about half as large and produces good results.
I was able to reuse and extend the existing S3 tokenizer.
You'll need to provide a short (5- to 10-second) sample recording of a voice to generate speech.
Convert weights and save locally
# Full precision (~3GB)
python mlx_audio/tts/models/chatterbox/scripts/convert_chatterbox.py -o ./Chatterbox-TTS-fp16
# 4-bit quantized (~1.6GB, quantizes T3 backbone only)
python mlx_audio/tts/models/chatterbox/scripts/convert_chatterbox.py -o ./Chatterbox-TTS-4bit --quantize
Generate speech with reference audio (voice cloning)
python -m mlx_audio.tts.generate \
--model ./Chatterbox-TTS-4bit \
--text "Hello, this is my cloned voice." \
--ref_audio sample.wav \
--play
The model is now available in the MLX Community on Hugging Face:
https://huggingface.co/mlx-community/Chatterbox-TTS-fp16 https://huggingface.co/mlx-community/Chatterbox-TTS-8bit https://huggingface.co/mlx-community/Chatterbox-TTS-4bit
You can try it like this with this branch of mlx-audio:
mlx_audio.tts --model mlx-community/Chatterbox-TTS-4bit --text "Hello, this is Chatterbox on MLX!" --ref_audio reference.wav --ref_text "."
I'm closing this in favor of further development on my own fork of this repo.