IMS-Toucan icon indicating copy to clipboard operation
IMS-Toucan copied to clipboard

Use Pytorch Lightning to speed up training ?

Open Ca-ressemble-a-du-fake opened this issue 1 year ago • 3 comments

Hi,

In a previous answer you wrote that you were looking for ways to improve training speed even though you were already satisfied with Toucan's training performance.

Have you ever considered using pytorch lightning ? They have for example automatic optimization, early stopping (skips the rest of the epoch when some conditions are met), mixed precision, ... that may speed up training. With more than one GPU then it could bring an increase in performance to training without thinking too much about it.

I don't want to push you to use that framework in any way, I just want to let you know it exists. If needed I could give you a hand on converting the project to lightning because they say that it is basically reorganizing the existing code.

Regards

Ca-ressemble-a-du-fake avatar Mar 24 '23 05:03 Ca-ressemble-a-du-fake

Most of the advantages that such automatic optimization managers promise unfortunately don't work for TTS. We have mixed lengths of sequences in batches, which makes splitting up batches for multi GPU training extremely difficult. Early stopping doesn't make sense for TTS, because there is no metric to accurately measure the performance, it's all subjective.

Speed increases like mixed precision with gradient scaling is already used in Toucan, because it is included in basic pytorch with no need for separate managers. It will however be removed in a future release, because mixed precision does not go together well with normalizing flows. Even with a grad scaler, numeric underflows happen often and the model completely collapses when mixed precision is used.

Stochastic weight averaging is also already implemented in basic pytorch, because the SWA of lightning is again incompatible with the weight norm that is used in normalizing flows for speech increases.

So overall, I'm constantly looking at new methods for speed increases, but all tools that promise to be a one for all solution I saw so far were not compatible with TTS for varying reasons.

Flux9665 avatar Apr 06 '23 12:04 Flux9665

Ok thanks for your reply. Now I have to learn what "normalizing flows" means (specially what a flow is in TTS) 😉.

Ca-ressemble-a-du-fake avatar Apr 06 '23 17:04 Ca-ressemble-a-du-fake

A normalizing flow is just another type of generative model, where you learn a mapping from one random variable to another by learning a stepwise transformation in one direction and then during inference use the inverse function to go the other direction. This has been used as the speech decoder in GlowTTS and it has been proposed as a speech-enhancement step after the TTS has already produced a preliminary spectrogram in PortaSpeech. We are using the normalizing flow the same way PortaSpeech does. So we first produce a spectrogram, then we use spectral gating to enhance the speech a little and then we use another post-processing step, which is this normalizing flow, which enhances the spectrogram even further. And because the normalizing flow is trained by maximising log-likelyhoods of samples to be in a target distribution, there are some numerical instabilities, which don't go together well with mixed precision.

Flux9665 avatar Apr 13 '23 12:04 Flux9665