hifi-gan
hifi-gan copied to clipboard
Pad audio fragment
Why do we need pad audio fragment while receiving its mel spec?
y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')
Hi @Alexey322
I think the author used padding for doing stft
(aka fast fourier transform) on all frames of the input audio segment.
spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[str(y.device)],
center=center, pad_mode='reflect', normalized=False, onesided=True)
You can check the torch.stft function from the API doc for more details.
Hi @leminhnguyen.
Thanks for your reply. Why can't we just align the fragment size with convolutions? With v1 configuration 29 mels correspond to 8192 samples, what's the point of adding redundant data?