StyleTTS2 icon indicating copy to clipboard operation
StyleTTS2 copied to clipboard

Speed difference for longer input text

Open Ananya21162 opened this issue 1 year ago • 3 comments

We are noticing very slow speed for small sentences and for longer sentences, the model starts normally and then gradually increases the speed to quite noticeably high, which sounds un-natural often. What could be the possible cause for this? Can anyone please help!

Ananya21162 avatar Dec 09 '24 18:12 Ananya21162

Latency generally increases as the length of the input sentence grows. However, a slowdown for short sentences is not typical and might indicate an issue. I've worked with StyleTTS2 and successfully reduced its latency by 2.5-3 times. If you can share your model file, I can investigate further to pinpoint the issue.

One possible reason for unnatural output is that StyleTTS2 is trained on audiobook datasets, where the style is tailored toward narration. This makes it perform well for longer sentences but struggle with shorter text, leading to degraded quality. Additionally, the model is trained with a high maximum sequence length, which could also explain the inconsistency when dealing with shorter inputs.

UmerrAhsan avatar Dec 12 '24 09:12 UmerrAhsan

Thank you so much for your response. I have trained model with libriTTS + 50 hrs of audio with max seq length=512. For very short input like : "Slide 1", the output is very slow. For very long inputs like: "The Supplier Accounts Receivable Specialist ensures the accurate submission of supplier invoices by verifying all required details, such as purchase order references and amounts, before uploading them into the system." The output is relatively fast. I am not sure what could be the possible reason? Is there something we can do while training the model?

Ananya21162 avatar Dec 20 '24 08:12 Ananya21162

Hi @Ananya21162,

Without seeing the code, I can't say much, but what I would suggest is to perform an inner ablation study. Print the time taken for each component during inference—such as the text encoder, BERT, alignment, prosody predictor, decoder, diffusion, and other relevant components. This way, you can identify which specific component is causing the issue, and that will help pinpoint the problem. Then let me know, and we can further debug it.

UmerrAhsan avatar Dec 20 '24 12:12 UmerrAhsan