hifi-gan
hifi-gan copied to clipboard
Can't train with output of Tacotron 2
The shape of mel output of Tacotron2 is bigger than mel extracted from audio and the model has issue
File "train.py", line 113, in train
for i, batch in enumerate(train_loader):
File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
data = self._next_data()
File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 2.
Original Traceback (most recent call last):
File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate
return [default_collate(samples) for samples in transposed]
File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 79, in <listcomp>
return [default_collate(samples) for samples in transposed]
File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 8192 and 8119 in dimension 1 at /pytorch/aten/src/TH/generic/THTensor.cpp:612
Got 8192 and 8119 in dimension 1 at /pytorch/aten/src/TH/generic/THTensor.cpp:612
I think this is more likely to be an incorrectly padded audio tensor. There shouldn't ever be a spectrogram that long being collated.
Saying that, I'm looking at the code and I can't figure out how this issue can occur. If you're able to provide more details then that'd help a lot.
Which tacotron2 repo are you using and what is the hop_length
you're using for both tacotron2 and hifigan?
I used the below parameters to train successfully with my audio data.
"segment_size": 8192,
"num_mels": 80,
"num_freq": 513,
"n_fft": 1024,
"hop_size": 256,
"win_size": 1024,
After that I have synthesized mel spectrogram from Tacotron 2 model with my dataset to train with --fine_tuning True
and the same parameters but it has been errored:
Using a target size (torch.Size([1, 80, 759])) that is different to the input size (torch.Size([1, 80, 815])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
val_err_tot += F.l1_loss(y_mel, y_g_hat_mel).item()
To fix this error I add padding to train.py file:
if y_mel.size(2) > y_g_hat_mel.size(2):
y_g_hat_mel = torch.nn.functional.pad(y_g_hat_mel, (0, y_mel.size(2) - y_g_hat_mel.size(2)), 'constant')
elif y_mel.size(2) < y_g_hat_mel.size(2):
y_mel = torch.nn.functional.pad(y_mel, (0, y_g_hat_mel.size(2) - y_mel.size(2)), 'constant')
And it is errored as above.
I have a very similar issue. I generated the mell files using tacotron2.
train.py:199: UserWarning: Using a target size (torch.Size([1, 80, 240])) that is different to the input size (torch.Size([1, 80, 235])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
val_err_tot += F.l1_loss(y_mel, y_g_hat_mel).item()
Traceback (most recent call last):
File "train.py", line 271, in <module>
main()
File "train.py", line 267, in main
train(0, a, h)
File "train.py", line 199, in train
val_err_tot += F.l1_loss(y_mel, y_g_hat_mel).item()
File "C:\Anaconda3\lib\site-packages\torch\nn\functional.py", line 2633, in l1_loss
expanded_input, expanded_target = torch.broadcast_tensors(input, target)
File "C:\Anaconda3\lib\site-packages\torch\functional.py", line 71, in broadcast_tensors
return _VF.broadcast_tensors(tensors) # type: ignore
RuntimeError: The size of tensor a (235) must match the size of tensor b (240) at non-singleton dimension 2
You can try the fix from this fork (line 245).
Replace in line 199:
val_err_tot += F.l1_loss(y_mel, y_g_hat_mel).item()
with:
val_err_tot += F.l1_loss(y_mel, y_g_hat_mel[:,:,:y_mel.size(2)]).item()
val_err_tot
This seem to solve that problem miss match between two mel spec, but when I compare difference y_mel, y_g_hat_mel, It is larger than when using mel from audio (not using mel from Tacotron 2 force alignment)
Can you share your result is okie or not?
e synthesized mel spectrogram from Tacotron 2 model with my dataset to train with
--fine_tuning True
and the same parameters but it has been errored:
The mel you get from Tacotron2 is at training step or inference step, As my knowledge, It is from force teacher at training.
val_err_tot
This seem to solve that problem miss match between two mel spec, but when I compare difference y_mel, y_g_hat_mel, It is larger than when using mel from audio (not using mel from Tacotron 2 force alignment)
Can you share your result is okie or not?
do you resolve this issue