espnet
espnet copied to clipboard
Why do you use Interpolating in pitch feature?
Hey, didn't find any answer to this question. i saw few implementations of pitch feature ( with/out quantisation, with/out un-voice predictor) but all of them used the interpolation in some point and i don't understand what for. if anyone knows i would love to understand.
I looked at the implementation here and the FastPitch paper and i didn't understand how the use of interpolation won't ruin the pitch prediction in the case of unvoice in the audio. It seems that the FS2 with the FastPitch's pitch predictor doesn't care if the value of the target pitch was 0 before the interpolation although it is very impotent for the audio.
Actually, I’ve never strictly compared f0 with continuous f0 when using token-averaged case. There is an option to change f0 type, so you can try it.
https://github.com/espnet/espnet/blob/6a46986cab33e401aaf730f066aabfbf4c719090/espnet2/tts/feats_extract/dio.py#L47
If you report the effectiveness, that is great.
i'm working on different repo so it will be hard for me to report, but I still can't understand isn't it a problem to use continuous f0? when does the model learn how to distinguish between unvoice and voice?
Since we use data-driven duration derived from attention, the unvoiced decision and unvoiced consonants correspondence is not perfect. Therefore, even if we exclude unvoiced part to calculate token-averaged f0 as in FastPitch, I think the value for unvoiced consonants will not be 0. And therefore, I think there is not significant difference between continuous f0 and not continuous f0. Instead of F0, phoneme representation may inform the unvoiced and voiced part difference.