hifi-gan Pad audio fragment

Pad audio fragment

Open Alexey322 opened this issue 3 years ago • 2 comments

Why do we need pad audio fragment while receiving its mel spec?

y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')

Jul 23 '21 15:07 Alexey322

Hi @Alexey322

I think the author used padding for doing stft (aka fast fourier transform) on all frames of the input audio segment.

spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[str(y.device)],
                      center=center, pad_mode='reflect', normalized=False, onesided=True)

You can check the torch.stft function from the API doc for more details.

Jul 24 '21 11:07 leminhnguyen

Hi @leminhnguyen.

Thanks for your reply. Why can't we just align the fragment size with convolutions? With v1 configuration 29 mels correspond to 8192 samples, what's the point of adding redundant data?

Jul 27 '21 12:07 Alexey322

hifi-gan hifi-gan copied to clipboard

Pad audio fragment

hifi-gan
hifi-gan copied to clipboard