hifi-gan icon indicating copy to clipboard operation
hifi-gan copied to clipboard

Output audio duration does not exactly match input audio.

Open shawnbzhang opened this issue 4 years ago • 9 comments

Running through your pre-trained models, I found that generated audio does not exactly match the input in duration length. For example,

wav, sr = load_wav(os.path.join(a.input_wavs_dir, filname))
wav = wav / MAX_WAV_VALUE
wav = torch.FloatTensor(wav).to(device)  # wav shape is torch.Size([71334])
x = get_mel(wav.unsqueeze(0))  # x shape is torch.Size([1, 80, 278])
y_g_hat = generator(x)  # y_g_hat shape is torch.Size([1, 1, 71168])

As you can see, there is a mismatch of 71334 and 71168. What is happening, and why is this the case? Is there a way that I can change it so that the input and output shapes match?

Thank you.

Edit: So I was checking training, and if the target segment_size is a multiple of 256 (hop_size), then y_g_hat = generator(x) will also have the exact number.

shawnbzhang avatar Dec 20 '20 22:12 shawnbzhang

I bring up this issue because with its real-time capabilities, this may pose a problem when streaming the input.

shawnbzhang avatar Dec 20 '20 22:12 shawnbzhang

This mismatch is cause padding and transposed convolution, you should set segment_size % hop_size == 0(segment_size + (nfft - hop_size can get segmentsize % hop_size frames melspectrum). In other words, one frame represent hop_size sampling points.

Miralan avatar Dec 21 '20 01:12 Miralan

@Miralan Thank you for the response, but I'm a bit confused. I understand that segment_size % hop_size == 0 will make the input and generated output waveforms match lengths. Is there a way to do this in the general inference.py, or should I just zero-pad the input so that segment_size % hop_size == 0?

shawnbzhang avatar Dec 21 '20 01:12 shawnbzhang

@Miralan Thank you for the response, but I'm a bit confused. I understand that segment_size % hop_size == 0 will make the input and generated output waveforms match lengths. Is there a way to do this in the general inference.py, or should I just zero-pad the input so that segment_size % hop_size == 0?

When you inference both from wav and mel, you do not need to set segment_size % hop_size == 0 , it does not matter . Zero pad for segment_size is also ok for match segment_size and hop_size, when you training. But I think 71168 is too bigger, It may take lots of gpu memory which will cause less batch_size.

Miralan avatar Dec 21 '20 02:12 Miralan

If the streaming you mentioned meant to immediately feed a portion of the output from the 1st stage model to HiFi-GAN, you could add padding to match the length of the output audio, but the audio with a break in the padding part will be synthesized. I would like to recommend cutting the output mel-spectrogram from the 1st stage model to match the length of the output audio before feeding to HiFi-GAN.

jik876 avatar Dec 21 '20 06:12 jik876

@jik876 For some more context, I am doing research on neural voice conversion, which is why I was really impressed with your non-autoregressive vocoder. In the streaming context, ideally I would like 10ms chunk from my source speaker's audio to translate to a 10ms chunk of the generated speaker's audio. Therefore, I guess it makes sense for me to stream in chunk_size % hop_size == 0 source inputs to get the respective outputs. What do you think?

Is that right? And again thank you for your work and insight.

shawnbzhang avatar Dec 21 '20 07:12 shawnbzhang

Thank you. It is correct to adjust chunk_size to be divided by hop_size. I don't know what sample rate you're using, but 10ms chunk seems too short to generate high quality audio considering the receptive field of the generator.

jik876 avatar Dec 22 '20 06:12 jik876

@shawnbzhang have you solved it? And how to do?

yBeOne avatar Dec 23 '20 01:12 yBeOne

@jik876 For some more context, I am doing research on neural voice conversion, which is why I was really impressed with your non-autoregressive vocoder. In the streaming context, ideally I would like 10ms chunk from my source speaker's audio to translate to a 10ms chunk of the generated speaker's audio. Therefore, I guess it makes sense for me to stream in chunk_size % hop_size == 0 source inputs to get the respective outputs. What do you think?

Is that right? And again thank you for your work and insight.

It seem impossible to have chunk 10ms, by my knowledge if we use sr=22050 then 10ms have only 220 samples, it is even smaller than the window length ?

v-nhandt21 avatar Nov 11 '21 04:11 v-nhandt21