hifi-gan icon indicating copy to clipboard operation
hifi-gan copied to clipboard

Can't train with output of Tacotron 2

Open tuong-olli opened this issue 3 years ago • 7 comments

The shape of mel output of Tacotron2 is bigger than mel extracted from audio and the model has issue

 File "train.py", line 113, in train
    for i, batch in enumerate(train_loader):
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
    return self._process_data(data)
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 2.
Original Traceback (most recent call last):
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 79, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 8192 and 8119 in dimension 1 at /pytorch/aten/src/TH/generic/THTensor.cpp:612

tuong-olli avatar Jan 25 '21 10:01 tuong-olli

Got 8192 and 8119 in dimension 1 at /pytorch/aten/src/TH/generic/THTensor.cpp:612

I think this is more likely to be an incorrectly padded audio tensor. There shouldn't ever be a spectrogram that long being collated.

Saying that, I'm looking at the code and I can't figure out how this issue can occur. If you're able to provide more details then that'd help a lot.

Which tacotron2 repo are you using and what is the hop_length you're using for both tacotron2 and hifigan?

CookiePPP avatar Jan 25 '21 15:01 CookiePPP

I used the below parameters to train successfully with my audio data.

"segment_size": 8192,
    "num_mels": 80,
    "num_freq": 513,
    "n_fft": 1024,
    "hop_size": 256,
    "win_size": 1024,

After that I have synthesized mel spectrogram from Tacotron 2 model with my dataset to train with --fine_tuning True and the same parameters but it has been errored:

Using a target size (torch.Size([1, 80, 759])) that is different to the input size (torch.Size([1, 80, 815])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  val_err_tot += F.l1_loss(y_mel, y_g_hat_mel).item()

To fix this error I add padding to train.py file:

if y_mel.size(2) > y_g_hat_mel.size(2):
     y_g_hat_mel = torch.nn.functional.pad(y_g_hat_mel, (0, y_mel.size(2) - y_g_hat_mel.size(2)), 'constant')
elif y_mel.size(2) < y_g_hat_mel.size(2):
     y_mel = torch.nn.functional.pad(y_mel, (0, y_g_hat_mel.size(2) - y_mel.size(2)), 'constant')

And it is errored as above.

tuong-olli avatar Jan 26 '21 01:01 tuong-olli

I have a very similar issue. I generated the mell files using tacotron2.

train.py:199: UserWarning: Using a target size (torch.Size([1, 80, 240])) that is different to the input size (torch.Size([1, 80, 235])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  val_err_tot += F.l1_loss(y_mel, y_g_hat_mel).item()
Traceback (most recent call last):
  File "train.py", line 271, in <module>
    main()
  File "train.py", line 267, in main
    train(0, a, h)
  File "train.py", line 199, in train
    val_err_tot += F.l1_loss(y_mel, y_g_hat_mel).item()
  File "C:\Anaconda3\lib\site-packages\torch\nn\functional.py", line 2633, in l1_loss
    expanded_input, expanded_target = torch.broadcast_tensors(input, target)
  File "C:\Anaconda3\lib\site-packages\torch\functional.py", line 71, in broadcast_tensors
    return _VF.broadcast_tensors(tensors)  # type: ignore
RuntimeError: The size of tensor a (235) must match the size of tensor b (240) at non-singleton dimension 2

ghost avatar Feb 24 '21 15:02 ghost

You can try the fix from this fork (line 245).

Replace in line 199: val_err_tot += F.l1_loss(y_mel, y_g_hat_mel).item()

with: val_err_tot += F.l1_loss(y_mel, y_g_hat_mel[:,:,:y_mel.size(2)]).item()

longjoke avatar Jul 30 '21 14:07 longjoke

val_err_tot

This seem to solve that problem miss match between two mel spec, but when I compare difference y_mel, y_g_hat_mel, It is larger than when using mel from audio (not using mel from Tacotron 2 force alignment)

Can you share your result is okie or not?

v-nhandt21 avatar Oct 12 '21 18:10 v-nhandt21

e synthesized mel spectrogram from Tacotron 2 model with my dataset to train with --fine_tuning True and the same parameters but it has been errored:

The mel you get from Tacotron2 is at training step or inference step, As my knowledge, It is from force teacher at training.

v-nhandt21 avatar Oct 12 '21 18:10 v-nhandt21

val_err_tot

This seem to solve that problem miss match between two mel spec, but when I compare difference y_mel, y_g_hat_mel, It is larger than when using mel from audio (not using mel from Tacotron 2 force alignment)

Can you share your result is okie or not?

do you resolve this issue

Body123 avatar Dec 25 '22 14:12 Body123