speedyspeech icon indicating copy to clipboard operation
speedyspeech copied to clipboard

Duplicated padding when calculating STFT transformation.

Open iclementine opened this issue 3 years ago • 3 comments

I went through the code about spectrogram extraction in the code base and found some duplication at padding.

For a simple case, applying stft to an audio, it is common to use center and reflect padding to simplify the time grid of sample index and frame index. (It's also the default behavior of librosa).

But for a batch of audios. An batched STFT implemented with Convolution, there is not straight way to handle reflect padding for each audio since they are already padded. Since we would have to find the ending of each audio to correctly apply reflect padding at the end.

I noticed that in your code, each audio is first center reflect padded. And then they are padded to the max length and batched. (https://github.com/janvainer/speedyspeech/blob/516eb1c6a992dacf5808e416c389e5322d9e80bf/code/stft.py#L90)

But when passing this batched audio to the STFT module, they are padded again.(https://github.com/janvainer/speedyspeech/blob/516eb1c6a992dacf5808e416c389e5322d9e80bf/code/utils/torch_stft.py#L103)

I think when the audios are mannually reflect padded, they should not be padded again when applying batched STFT. Librosa's stft has an option center=False to disable internal padding.

iclementine avatar Apr 16 '21 07:04 iclementine

Hm it seems like you are right. This may be a bug. Did you compare the results with and without it?

janvainer avatar Apr 16 '21 14:04 janvainer

This may not affect the quality much since the length of padding is very short (about 23ms). I haven't compared the results.

iclementine avatar Apr 16 '21 19:04 iclementine

This could maybe have some impact on the audio quality on the edges of the utterance. I noticed there some edge related problems before. Thanks for pointing this out!

janvainer avatar Apr 18 '21 07:04 janvainer