FastSpeech2 icon indicating copy to clipboard operation
FastSpeech2 copied to clipboard

The duration groud truth data shouldn't be used in synthesis in the validation step.

Open chenming6615 opened this issue 3 years ago • 1 comments

The duration_target is not None even when it is called in the validation step, which will cause the model to use the ground truth directly instead of the predicted data in validation.

if duration_target is not None:
            x, mel_len = self.length_regulator(x, duration_target, max_len)
            duration_rounded = duration_target
else:
    duration_rounded = torch.clamp(
        (torch.round(torch.exp(log_duration_prediction) - 1) * d_control),
        min=0,
    )
    x, mel_len = self.length_regulator(x, duration_rounded, max_len)
    mel_mask = get_mask_from_lengths(mel_len)

https://github.com/ming024/FastSpeech2/blob/d4e79eb52e8b01d24703b2dfc0385544092958f3/model/modules.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L128

We should slice out the duration_target data by changing output = model(*(batch[2:])) into output = model(*(batch[2:11])) here.

https://github.com/ming024/FastSpeech2/blob/d4e79eb52e8b01d24703b2dfc0385544092958f3/evaluate.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L43

chenming6615 avatar Feb 08 '22 14:02 chenming6615

The duration_target is not None even when it is called in the validation step, which will cause the model to use the ground truth directly instead of the predicted data in validation.

if duration_target is not None:
            x, mel_len = self.length_regulator(x, duration_target, max_len)
            duration_rounded = duration_target
else:
    duration_rounded = torch.clamp(
        (torch.round(torch.exp(log_duration_prediction) - 1) * d_control),
        min=0,
    )
    x, mel_len = self.length_regulator(x, duration_rounded, max_len)
    mel_mask = get_mask_from_lengths(mel_len)

https://github.com/ming024/FastSpeech2/blob/d4e79eb52e8b01d24703b2dfc0385544092958f3/model/modules.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L128

We should slice out the duration_target data by changing output = model(*(batch[2:])) into output = model(*(batch[2:11])) here.

https://github.com/ming024/FastSpeech2/blob/d4e79eb52e8b01d24703b2dfc0385544092958f3/evaluate.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L43

I think this is not a bug. During evaluation, we still want to calculate the loss of mel spectrograms. So we should use ground truth durations to make sure that our model can generate predicted mel spectrogram which has the same lengths as ground truth mel spectrograms. Otherwise we may get predictions which have different lengths from corresponding ground truths and this can make loss calculation harder.

unrea1-sama avatar May 06 '22 12:05 unrea1-sama