FastSpeech2
FastSpeech2 copied to clipboard
The duration groud truth data shouldn't be used in synthesis in the validation step.
The duration_target
is not None even when it is called in the validation step, which will cause the model to use the ground truth directly instead of the predicted data in validation.
if duration_target is not None:
x, mel_len = self.length_regulator(x, duration_target, max_len)
duration_rounded = duration_target
else:
duration_rounded = torch.clamp(
(torch.round(torch.exp(log_duration_prediction) - 1) * d_control),
min=0,
)
x, mel_len = self.length_regulator(x, duration_rounded, max_len)
mel_mask = get_mask_from_lengths(mel_len)
https://github.com/ming024/FastSpeech2/blob/d4e79eb52e8b01d24703b2dfc0385544092958f3/model/modules.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L128
We should slice out the duration_target data by changing output = model(*(batch[2:]))
into output = model(*(batch[2:11]))
here.
https://github.com/ming024/FastSpeech2/blob/d4e79eb52e8b01d24703b2dfc0385544092958f3/evaluate.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L43
The
duration_target
is not None even when it is called in the validation step, which will cause the model to use the ground truth directly instead of the predicted data in validation.if duration_target is not None: x, mel_len = self.length_regulator(x, duration_target, max_len) duration_rounded = duration_target else: duration_rounded = torch.clamp( (torch.round(torch.exp(log_duration_prediction) - 1) * d_control), min=0, ) x, mel_len = self.length_regulator(x, duration_rounded, max_len) mel_mask = get_mask_from_lengths(mel_len)
https://github.com/ming024/FastSpeech2/blob/d4e79eb52e8b01d24703b2dfc0385544092958f3/model/modules.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L128
We should slice out the duration_target data by changing
output = model(*(batch[2:]))
intooutput = model(*(batch[2:11]))
here.https://github.com/ming024/FastSpeech2/blob/d4e79eb52e8b01d24703b2dfc0385544092958f3/evaluate.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L43
I think this is not a bug. During evaluation, we still want to calculate the loss of mel spectrograms. So we should use ground truth durations to make sure that our model can generate predicted mel spectrogram which has the same lengths as ground truth mel spectrograms. Otherwise we may get predictions which have different lengths from corresponding ground truths and this can make loss calculation harder.