Matcha-TTS
Matcha-TTS copied to clipboard
starting from N(mu, I) or starting from N(0, I)???which is better
Dear Author,
Dear Author,
Hello, recently I made some modifications on your architecture. During the learning process, I found that GradTTS uses sampling from N(mu, I). When I trained using N(mu, I) on the MatchaTTS framework, the generation started to output only noise, making it impossible to synthesize sound. Have you ever tried adding noise starting from N(mu, I) for sound synthesis?
Best regards.
Practically, we do concatenate both the mu and the random noise in the Unet https://github.com/shivammehta25/Matcha-TTS/blob/108906c603fad5055f2649b3fd71d2bbdf222eac/matcha/models/components/decoder.py#L384 So, we didn't see much difference, but the conditional flow matching framework is not dependent on the initial distribution as long as you can sample from it. When I started from $\mathcal{N}(\mu, I)$ I did not notice major changes, for me it was working similarly to the current setup. I am guessing, you might have some bug somewhere during sampling as it worked for me. However, I did not notice much difference, perhaps due to concatenating.
Hope this helps.