Add support for streaming in Orpheus
I see in the changelog for 0.0.3 we should be able to 'Play audio segments as they are generated #26" but I'm having trouble getting that to work.
I might be doing something silly! Here are my results, it still saves a wav file, then starts playing after:
% python -m mlx_audio.tts.generate --model mlx-community/orpheus-3b-0.1-ft-4bit --text "Hello world" --play
Fetching 6 files: 100%|████████████████████████| 6/6 [00:00<00:00, 62601.55it/s]
Model: mlx-community/orpheus-3b-0.1-ft-4bit
Text: Hello world
Voice: None
Speed: 1.0x
Language: a
0%| | 0/1200 [00:00<?, ?it/s]mx.metal.set_wired_limt is deprecated and will be removed in a future version. Use mx.set_wired_limit instead.
mx.metal.get_peak_memory is deprecated and will be removed in a future version. Use mx.get_peak_memory instead.
mx.metal.clear_cache is deprecated and will be removed in a future version. Use mx.clear_cache instead.
10%|███▋ | 114/1200 [00:00<00:07, 143.25it/s]
==========
Duration: 00:00:01.365
Samples/sec: 0.7
Prompt: 1 tokens, 0.7 tokens-per-sec
Audio: 1 samples, 0.7 samples-per-sec
Real-time factor: 1.12x
Processing time: 1.22s
Peak memory usage: 1.92GB
✅ Audio successfully generated and saving as: audio_000.wav
Hey,
No, you are not. It's a missing feature actually.
Orpheus at the moment generates all the tokens then we play. Will be fixed :)
It can only stream text that you plit (paragraph N).
Thanks for the reply Blaizzy, and thanks for all of your work!
It would be really exciting to have streaming! I think a lot of us are working on speech to speech pipelines, myself included, and a streaming output from the TTS is the last gap to close.
I have an M1 ultra that can generate faster than realtime with both orpheus and CSM, and I love the results. If I could just play the first audio bytes out sooner, I'd be so happy!
There is another project that has implemented a streaming solution for CSM, but it is CUDA based: https://github.com/davidbrowne17/csm-streaming
I attempted to bring it over to MLX myself, but their implementation appears to use some unsupported operations on mlx, and that is unfortunately over my head at this time.
I'm really looking forward to this great addition!
@Blaizzy , you mentioned:
It can only stream text that you plit (paragraph N).
What are you referring to, exactly? Is streaming something already available somehow?
Quick update on this: as already pointed out in https://github.com/Blaizzy/mlx-audio/issues/87 , Orpehus doesn't process multiline text.