vits icon indicating copy to clipboard operation
vits copied to clipboard

Inference benchmarks on CPU / single-thread performance

Open snakers4 opened this issue 3 years ago • 2 comments

Hi @jaywalnut310 ,

Many thanks for your work! As usual this is very thorough, open and inspiring.

In your paper you publish the GPU speed benchmarks:

We measured the synchronized elapsed time over the entire process to generate raw waveforms from phoneme sequences with 100 sentences randomly selected from the test set of the LJ Speech dataset. We used a single NVIDIA V100 GPU with a batch size of 1.

image

But somehow you do not say anything about running on CPU (1 CPU thread). Notably, this is also omitted from papers like TalkNet, / FastSpeech / GlowTTS (I believe this paper is mostly GlowTTS meets HifiGAN).

The only paper saying anything about CPU speed is LightSpeech:

image

Is this because flow-based models do not lend themselves well to CPU inference?

snakers4 avatar Jun 15 '21 07:06 snakers4

Hi @snakers4. We reported inference speed tests on a GPU server rather than CPU only environments, as it is a representative indicator for speed comparison in many papers. I think there is no reason for VITS and flow-based TTS models to be worse than other TTS models such as FastSpeech 2, but If you're focusing on CPU or on-device inference, it would be great to check each model's real time factor.

jaywalnut310 avatar Jun 15 '21 23:06 jaywalnut310

Many thanks for your reply! Did you by any chance run any benchmarks of your models on CPU only environments that you did not include in the final paper?


We are still using plain tacotron as a main TTS models and basically now we are weighting the pros and cons of investing time in developing something something akin to TalkNet / LightSpeech / FastSpeech (there is no good code available for smaller models though) or your new pipeline.

TalkNet / LightSpeech / FastSpeech pros and cons:

  • Requires more annotation (i.e. alignment / duration) (though can be extracted from tacotron or CTC models easily)
  • Quite straight-forward, no complicated moving parts
  • Known and tested modules in production (convolutions or transformer modules), will perform on CPU well for sure
  • The smallest reported model is as small as 1-2M params, though no code from the authors
  • Probably with an extremely fast model (closer to LightSpeech) there will be less need for the vocoder

VITS pros and cons:

  • End-to-end, does not require alignment
  • Looks much more complicated, may require much more param tuning for other languages
  • On the surface looks like it requires more compute to train
  • High reported quality
  • Contains HiFi-GAN, which works

Correct me if I am wrong here, but I believe that many people do similar mental calculations now.

snakers4 avatar Jun 17 '21 08:06 snakers4