audio
                                
                                 audio copied to clipboard
                                
                                    audio copied to clipboard
                            
                            
                            
                        There is a contraction in WaveRNNInferenceWrapper during inference when batched
 
The waveform above is the ground truth and the one below is the generated waveform with batched set to true.
See this colab example (internal for more information).
Looking at the code of WaveRNNInferenceWrapper, the same overlap value is used for folding and unfolding.
https://github.com/pytorch/audio/blob/19f53cf249d69c81be3005d582d37530f0a3aef7/examples/pipeline_wavernn/wavernn_inference_wrapper.py#L193-L194
https://github.com/pytorch/audio/blob/19f53cf249d69c81be3005d582d37530f0a3aef7/examples/pipeline_wavernn/wavernn_inference_wrapper.py#L204-L205
But the former is spectrogram while the later is waveform, and they do not have the same number of samples over time axis, even though they should have a similar wall time range. I think that is the cause.
In the example notebook, the shapes of the tensors look like the following;
- Input spectrogram [1, 80, 164]
- Spectrogram after xfold [2, 80, 110]
- Output waveform [2, 1, 29150]
- Waveform after unfolding [1, 58295]