flowtron icon indicating copy to clipboard operation
flowtron copied to clipboard

Have trouble with long texts

Open MaratZakirov opened this issue 3 years ago • 6 comments

I am trying long sentences (~240 English symbols) and I found that the model has troubles to start at > 80 symbols, there are two main problems:

  1. flowtron (python inference.py) keeps only a small portion of the text at default settings.
  2. If I increase n_frames from 400 to 4000 (what is n_frames and why it somehow helps is an additional question) it helps but instead of throwing away big portion of my text it starts to repeat itself (its is possible due to RNN boundaries > 200 is possible to lengthy for it to work properly)

What are your thoughts, suggestions?

MaratZakirov avatar Feb 27 '21 12:02 MaratZakirov

does it repeat itself in the middle of the sentence or just at the start? if it's just at the start then it's a gating issue, possibly the gate loss suggests overfitting. you can improve if by re-training (fine-tuning) just the gate layer from scratch. otherwise it's an attention issue.

rafaelvalle avatar Feb 28 '21 00:02 rafaelvalle

I found that not all voices are equal in terms of stability for long sentences I currently found that these settings works OK for me.

    parser.add_argument('-i', '--id', default=5393, help='Speaker id', type=int)
    parser.add_argument('-n', '--n_frames', help='Number of frames',
                        default=1400, type=int)

n_frames should be definitely bigger than 400 in other case speaker will "eat" words.

Model I use is librittis

    parser.add_argument('-f', '--flowtron_path', default='models/flowtron_libritts.pt',
                        help='Path to flowtron state dict', type=str)
    parser.add_argument('-w', '--waveglow_path', default='models/waveglow_256channels_universal_v5.pt',

I also had to fix config.json

n_speakers = 123

        "training_files": "filelists/libritts_train_clean_100_audiopath_text_sid_shorterthan10s_atleast5min_train_filelist.txt",
        "validation_files": "libritts_train_clean_100_audiopath_text_sid_atleast5min_val_filelist.txt",

It is seems to me strange that for inference I actually need train data.

With this settings model can process complex sentences such: "I would argue that it is shortsighted to consider any country a constant ally or perpetual enemy of England."

Longer ones will tend to errors.

@rafaelvalle What is your expectations about amount of data I will need to train just 128 valued speaker embedding (model weights are frozen)?

MaratZakirov avatar Mar 09 '21 09:03 MaratZakirov

@rafaelvalle is it possible to improve the flowtron's attention? I noticed use_cumm_attention parameter, which enables AttentionConditioningLayer, but it's disabled by default. Could this potentially improve attention alignment? There are also Graves and Dynamic Convolution Attention methods for Tacotron 2, I don't know if something similar could be adapted for Flowtron.

nicemanis avatar Mar 12 '21 16:03 nicemanis

you can fine tune the model with use_cumm_attention and checking if you get better results. another option is to always use the attention prior, i.e. during training and inference.

other attention models can be adapted to flowtron. please do send a pull request if you decide to implement them.

rafaelvalle avatar Mar 16 '21 23:03 rafaelvalle

@rafaelvalle Thank you for the answer. I am interested in trying out both GMM and DCA attention models with Flowtron. Although, I'm not exactly sure about the adaption part. Are there any significant differences between the Tacotron 2 and Flowtron in regards to how the attention module works between the encoder and decoder? Do you have any tips on how to adapt an attention model for Flowtron, which is working on Tacotron 2?

From what I found when comparing the two, it seemed that they are quite similar. Both utilize queries, inputs/values/memory, processed_inputs/keys/processed_memory. One difference that I spoted was in the energy calculation:

Tacotron 2:

energies = self.v(torch.tanh(processed_query + processed_attention_weights + processed_memory))

Flowtron:

attn = self.v(torch.tanh((queries[:, :, None] + keys[:, None])))
# where keys == text * attn_cond_vector

Instead of adding the cummulative attention weights, like in Tacotron 2, Flowtron multiplies keys (processed_memory) with the attention conditioning vector. Do both achieve the same thing or are they implemented differently deliberately?

nicemanis avatar Apr 06 '21 12:04 nicemanis

@rafaelvalle I also noticed that Flowtron does not concatenate the attention context to the decoder input when passing it to the attention RNN like Tacotron 2 .

Tacotron 2:

# Fragment from Tacotron decoder forward method
cell_input = torch.cat((decoder_input, self.attention_context), -1)
self.attention_hidden, self.attention_cell = self.attention_rnn(
    cell_input, (self.attention_hidden, self.attention_cell))
self.attention_hidden = F.dropout(
    self.attention_hidden, self.p_attention_dropout, self.training)

# self.attention()
self.attention_context = self.attention_layer(
    self.attention_hidden, self.memory, self.processed_memory, self.mask)

Here the the self.attention_context is acummulated whilst iterating over the decoder inputs, and the attention RNN cell input consists from the concatenation of the individual decoder input and accumulated context.

Flowtron:

# forward method
mel0 = torch.cat([dummy, mel[:-1, :, :]], 0)
attention_hidden = self.attention_lstm(mel0)[0]
attention_context, attention_weights = self.attention_layer(
    attention_hidden, text, text, mask=mask, attn_prior=attn_prior)

# infer method
attention_hidden, (h, c) = self.attention_lstm(output, (h, c))
attention_context, attention_weight = self.attention_layer(
    attention_hidden, text * attn_cond_vector, text, attn=attn,
    attn_prior=attn_prior)

Here, in the infer method, there is a for loop which iterates over n_frames and the h (attention_hidden) and c (attention_cell) are accummulated, similar to the Tacotron 2 approach, but the attention context is not utilized in the attention RNN cell input. Also, it seems that the forward method is not iterating over the inputs, one by one, but processes them all together, and, instead of using the attention context, a dummy vector is used.

Why does Flowtron's attention RNN use mel input without attention context? Both Taco2 and Flowtron are using attention_hidden and attention_context as decoder RNN inputs afterward.

nicemanis avatar Apr 07 '21 15:04 nicemanis