hifi-gan icon indicating copy to clipboard operation
hifi-gan copied to clipboard

Pad audio fragment

Open Alexey322 opened this issue 3 years ago • 2 comments

Why do we need pad audio fragment while receiving its mel spec?

y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')

Alexey322 avatar Jul 23 '21 15:07 Alexey322

Hi @Alexey322

I think the author used padding for doing stft (aka fast fourier transform) on all frames of the input audio segment.

spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[str(y.device)],
                      center=center, pad_mode='reflect', normalized=False, onesided=True)

You can check the torch.stft function from the API doc for more details.

leminhnguyen avatar Jul 24 '21 11:07 leminhnguyen

Hi @leminhnguyen.

Thanks for your reply. Why can't we just align the fragment size with convolutions? With v1 configuration 29 mels correspond to 8192 samples, what's the point of adding redundant data?

Alexey322 avatar Jul 27 '21 12:07 Alexey322