StyleTTS2 Inference latency

I was trying out the model with 439 characters and saw 5-6 sec of average latency on libri-TTS dataset. Is there a way we can reduce the latency (decoder takes the most time). Also, I finetuned the model with a few samples from a new speaker and saw the latency increased by 600-700 ms further, is this expected? Is the latency expected to increase if the dataset is larger (english only)? Similarly if we add more languages, is the model inference latency going to increase?

Oct 10 '24 12:10 Ananya21162

HifiGAN is essentially larger and heavier.

you need to either find another ckpt pretrained on ISTFT or train a new model yourself from scratch. you can also fine tune on top of the LJ ckpt which is not recommended but one of my friends managed to get reasonable results by doing so.

as for your other questions, no the dataset have no impact on the latency. only the parameters of your model and mainly the size of the decoder matters the most.

Oct 12 '24 14:10 Respaired

Thanks for your reply. We have two models; one trained on libriTTS-R (360+100 hrs) data and the other finetuned on this model with 20 min audio samples for multiple speakers. We kept max_len 100 for the first and 400 for the second one. The first model and the second one have an average latency difference of nearly 1.5 sec. Is it because of this parameter? What should be the ideal value?

Oct 15 '24 13:10 Ananya21162

You're welcome. as i've said, your choice of max_len or the dataset shouldn't matter. only the decoder has the largest impact.

Oct 17 '24 17:10 Respaired

Understood. But in our experiment, we checked the size of decoder for both models mentioned above. It was same for both , 217 MBs. But still both models have a latency difference of 1.5 seconds. Do you know of any other possible cause? In fact, we compared all the model components and they are consistent for all

bert size: 201359360 / bit | 25.17 / MB
bert_encoder size: 12599296 / bit | 1.57 / MB
predictor size: 518227584 / bit | 64.78 / MB
decoder size: 1737263744 / bit | 217.16 / MB
text_encoder size: 179404800 / bit | 22.43 / MB
predictor_encoder size: 444186016 / bit | 55.52 / MB
style_encoder size: 444186016 / bit | 55.52 / MB
diffusion size: 1620926464 / bit | 202.62 / MB
text_aligner size: 251790464 / bit | 31.47 / MB
pitch_extractor size: 168037024 / bit | 21.00 / MB
mpd size: 1315384640 / bit | 164.42 / MB
msd size: 8988864 / bit | 1.12 / MB
wd size: 37556288 / bit | 4.69 / MB
Total Model size: 6939910560 / bit | 867.49 / MB

Oct 21 '24 10:10 Ananya21162

Also, one model is trained from scratch and the other one is fine-tuned. Will that make any difference? Num of Model params & model size is same :/

Oct 21 '24 12:10 Ananya21162

Unless you change the decoder, or use very short samples with LFInference, there must not be a whole lot of latency overhead

Nov 08 '24 09:11 Respaired

it’s unusual that fine-tuning StyleTTS2 increases the checkpoint file size, even though the number of parameters in the model remains the same. Has anyone identified the reason behind this size increase?

Dec 12 '24 09:12 UmerrAhsan

@UmerrAhsan very late answer but if that helps: optimizer parameters are often saved into the checkpoint. Once you're done training, it makes sense to prune them again to make the checkpoint smaller. If you've downloaded a pruned pre-trained checkpoint (no optimizer parameters included) and then finetune it, you probably save it again with optimizer parameters, so it is bigger than before. I don't know if that's the case here though

Mar 07 '25 05:03 DeinAlptraum