mlx-audio Dia output audio is too fast

https://x.com/eaccelerate_42/status/1916232819082494155?s=46

Apr 26 '25 22:04 Blaizzy

@lucasnewman here is a interesting edge case for Dia

If you take this prompt it will generate audio with very fast speech.


python -m mlx_audio.tts.generate --model mlx-community/Dia-1.6B --text "[S1] Dr. Aris, the AI progress is just breathtaking, isn't it? Where do you see this heading in, say, 50 years? [S2] Dr. Lena, honestly? It’s both exhilarating and terrifying. Sometimes I think we're on the cusp of true AGI. Imagine fully conscious machines! [S1] Consciousness! (clears throat) That's the Pandora's Box. The ethical frameworks lag so far behind. We need more than just technical patches. [S2] Agreed. But the upsides! Solving scarcity, disease, maybe even mortality! (laughs) It warrants consideration! [S1] Consideration now, perhaps. But the existential risks... uncontrollable superintelligence, societal upheaval. [S2] Maybe control is the wrong paradigm? Perhaps symbiosis? A partnership? It's a lot to consider. [S1] A partnership with the unknown. Let's hope it's a harmonious future we're building." --file_prefix scientists_filtered_tags_output

Apr 26 '25 22:04 Blaizzy

I noticed that reducing temperature make us slightly slower (+2sec)

Apr 26 '25 22:04 Blaizzy

Breaking the text into 4 turns max seems to address the speed issue

python -m mlx_audio.tts.generate --model mlx-community/Dia-1.6B --text "[S1] Dr. Aris, the AI progress is just breathtaking, isn't it? Where do you see this heading in, say, 50 years? [S2] Dr. Lena, honestly? It’s both exhilarating and terrifying. Sometimes I think we're on the cusp of true AGI. Imagine fully conscious machines! [S1] Consciousness! (clears throat) That's the Pandora's Box. The ethical frameworks lag so far behind. We need more than just technical patches. [S2] Agreed. But the upsides! Solving scarcity, disease, maybe even mortality! (laughs) It warrants consideration!  \\n [S1] Consideration now, perhaps. But the existential risks... uncontrollable superintelligence, societal upheaval.  [S2] Maybe control is the wrong paradigm? Perhaps symbiosis? A partnership? It's a lot to consider. [S1] A partnership with the unknown. Let's hope it's a harmonious future we're building." --file_prefix scientists_filtered_tags_output

Apr 27 '25 00:04 Blaizzy

That's just a problem with the model in general, the PyTorch implementation is the same. It is fast then when it is longer text, instead of increasing length, it will just speed it up. As you said, breaking it up will make it better. It cannot do long text right now - see https://github.com/nari-labs/dia/issues/35

Apr 27 '25 02:04 drewbitt

I put up a change here that works around this to some extent by splitting S1/S2 segments: https://github.com/Blaizzy/mlx-audio/pull/100

You also need to make sure you're passing --sample_rate 44100 to the generation, or the reference audio and audio player won't produce audio at the correct sample rate, as it doesn't auto-detect based on the model for now.

The model still struggles with excessively long pauses in some situations, especially the ellipsis break used in the example here: 'But the existential risks... ' -- I wonder if some post-processing of the logits to add a repetition penalty could help.

Apr 27 '25 17:04 lucasnewman

Thanks Lucas!

Indeed make sure sample rate is 44100, because the default 24000 sounds bad.

The model still struggles with excessively long pauses in some situations, especially the ellipsis break used in the example here: 'But the existential risks... ' -- I wonder if some post-processing of the logits to add a repetition penalty could help.

I noticed the same, lowering temperature and ensure you have every chunks starts with s2 improved it. But I'm open to the idea of exploring repetition penalty.

Apr 28 '25 10:04 Blaizzy

I opened a new issue for the pauses so can keep track and return to it.

Apr 28 '25 10:04 Blaizzy