StyleTTS2 Different Audio Outputs with the Same Model

I trained the first and second stages on a device with a GPU, and the results were quite successful. I then copied the generated epoch, config, and first-stage training files to another device that only has a CPU. However, the audio generated on the CPU device sounds different from the one generated on the GPU device. Isn’t it supposed to be enough to transfer just the epoch, config, and first-stage training files to run the model? Is it possible for the same model to produce different audio outputs on a GPU and a CPU?

Jun 25 '25 11:06 fkaplan

There's some information missing here - what kind of model did you train (single language, multilingual; single speaker, multi-speaker)?

From my experience, if you're moving the data between machines, you'll need config file (from Models folder, not from Configs folder, unless you used a fixed sigma, since sigma gets auto-generated and written into the config in Models folder - if you try using config from Configs folder, you won't have the correct sigma and will probably experience lots of artifacts in the inferred output).

Then, you'll need the 2nd stage model, and if you plan to train this further, then also the final model from the 1st stage.

I've not tried GPU vs CPU but I did try different machines (albeit with same GPU types) and this worked fine for me.

Jul 03 '25 12:07 martinambrus

Hi Martin, thank you for getting back to me. I trained the model using a single language (English) and voice recordings from a single speaker. The results were quite satisfactory. However, as I mentioned in my previous message, when I transferred the config and the second-stage training model under the "Model" directory to another computer, I wasn’t able to get the same voice output. Unfortunately, we don’t have enough GPU-equipped devices available, so I wanted to run the trained model on a CPU instead. However, the audio outputs from the two machines turned out to be different.

Jul 31 '25 08:07 fkaplan

One more thing that comes to mind - the audio will always be a bit different due to randomness. If you need to have the same audio each time, you'd need to seed the random generator (see https://github.com/yl4579/StyleTTS2/issues/292). But I'm not sure if this is the issue if the quality is so much worse.

Jul 31 '25 08:07 martinambrus

I also considered that possibility, but the difference seems to be more about audio quality rather than minor variations caused by randomness. When I get the chance to test it on another GPU-equipped machine, I'll share the results and close the ticket. Thanks again for your support.

Jul 31 '25 08:07 fkaplan