Matcha-TTS starting from N(mu, I) or starting from N(0, I)？？？which is better

starting from N(mu, I) or starting from N(0, I)？？？which is better

Open zhaojingxin123 opened this issue 8 months ago • 1 comments

Dear Author,

Hello, recently I made some modifications on your architecture. During the learning process, I found that GradTTS uses sampling from N(mu, I). When I trained using N(mu, I) on the MatchaTTS framework, the generation started to output only noise, making it impossible to synthesize sound. Have you ever tried adding noise starting from N(mu, I) for sound synthesis?

Best regards.

Mar 25 '25 01:03 zhaojingxin123

Practically, we do concatenate both the mu and the random noise in the Unet https://github.com/shivammehta25/Matcha-TTS/blob/108906c603fad5055f2649b3fd71d2bbdf222eac/matcha/models/components/decoder.py#L384 So, we didn't see much difference, but the conditional flow matching framework is not dependent on the initial distribution as long as you can sample from it. When I started from $\mathcal{N}(\mu, I)$ I did not notice major changes, for me it was working similarly to the current setup. I am guessing, you might have some bug somewhere during sampling as it worked for me. However, I did not notice much difference, perhaps due to concatenating.

Hope this helps.

Apr 24 '25 09:04 shivammehta25

Matcha-TTS Matcha-TTS copied to clipboard

starting from N(mu, I) or starting from N(0, I)？？？which is better

Matcha-TTS
Matcha-TTS copied to clipboard