Alexey322
Alexey322
Hey. I trained the Tacotron 2 synthesizer from Rayhane-mamah, the synthesized spectrograms sound good if you use the griffin limm algorithm. Unfortunately, the vocoder in his repository learns with an...
Hey. I trained the Tacotron 2 synthesizer from Rayhane-mamah, the synthesized spectrograms sound good if you use the griffin limm algorithm. Unfortunately, the vocoder in his repository learns with an...
Why do we need pad audio fragment while receiving its mel spec? `y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')`
@jik876 Hi. I would like to know why you are not using the same parameters(for V1 configuration) as indicated in the paper? Your code has set the following parameters: "resblock_kernel_sizes":...
Hi, @jik876. Can you give some advice on how to change the model correctly for the 44100 sample rate? I don't mean hyperparameters in config. For example, how did you...
Hi. I started training the model from scratch and found that the optimizer uses a dynamic learning step. If I train the model with 2.5 million steps, then according to...
Hi. I trained the flowtron on two speakers, for a total of 50 hours, 25 for each. After that, I wanted to train the model for 10 speakers for 20-30...
Hello. Why the attention doesn't use speaker embeddings to find the alignment between text and mel spectrograms? This can vary greatly between speakers speaking in different styles.