audio icon indicating copy to clipboard operation
audio copied to clipboard

There is a contraction in WaveRNNInferenceWrapper during inference when batched

Open yangarbiter opened this issue 4 years ago • 1 comments

Screen Shot 2021-08-27 at 10 16 12 AM

The waveform above is the ground truth and the one below is the generated waveform with batched set to true.

See this colab example (internal for more information).

yangarbiter avatar Sep 01 '21 00:09 yangarbiter

Looking at the code of WaveRNNInferenceWrapper, the same overlap value is used for folding and unfolding.

https://github.com/pytorch/audio/blob/19f53cf249d69c81be3005d582d37530f0a3aef7/examples/pipeline_wavernn/wavernn_inference_wrapper.py#L193-L194

https://github.com/pytorch/audio/blob/19f53cf249d69c81be3005d582d37530f0a3aef7/examples/pipeline_wavernn/wavernn_inference_wrapper.py#L204-L205

But the former is spectrogram while the later is waveform, and they do not have the same number of samples over time axis, even though they should have a similar wall time range. I think that is the cause.

In the example notebook, the shapes of the tensors look like the following;

  1. Input spectrogram [1, 80, 164]
  2. Spectrogram after xfold [2, 80, 110]
  3. Output waveform [2, 1, 29150]
  4. Waveform after unfolding [1, 58295]

mthrok avatar Oct 09 '21 01:10 mthrok