Rishikesh (ऋषिकेश)
Rishikesh (ऋषिकेश)
Speed of a GPU mostly depends on program optimisation, GPU architecture, Memory clock, Type of Memory (Not in memory size), Memory Bandwidth ,PCIe bandwidth, Number of CUDA cores for parallel...
Hi @keonlee9420 , DelightfulTTS is similar to [Phone Level Mixture Density Network](https://github.com/rishikksh20/Phone-Level-Mixture-Density-Network-for-TTS) but here instead of using complicated GMM based model author directly used latent representation for Prosody Predictor and...
DelightfulTTS learn Phoneme level prosody implicitly whereas `Emphasis control for parallel neural TTS` learn same explicitly by extracting features from this [repo](https://github.com/asuni/wavelet_prosody_toolkit).
I think DelightfulTTS is all in one solution, it uses non-autoregressive architecture with conformer blocks and both Utterance level and Phoneme level predictor as well.
@keonlee9420 Hi, are you able to train DelightfullTTS successfully ?
Have you train predictor and extractor simultaneously or train extractor for 100k steps first then pause it and then start predictor training in teacher forcing method like mentioned in AdaSpeech...
Because in my case I do some modification in architecture, I used same extractors as mentioned in DelightfullTTS 's papers but I am not using any predictor for utterance level...
I suggest 1
@keonlee9420 In your experience which perform better normal Transformer encoder or Conformer when you have only 20 hours of speech data?
As per this [article](https://www.microsoft.com/en-us/research/blog/azure-ai-milestone-new-neural-text-to-speech-models-more-closely-mirror-natural-speech/) Microsoft TTS api built on DelightfullTTS.