fairseq2 icon indicating copy to clipboard operation
fairseq2 copied to clipboard

Using WaveformToFbankConverter with variable sample rates is impossible

Open avidale opened this issue 4 months ago • 0 comments

Describe the bug: The converter seems to stick to the first sample rate that has been fed into it, and refuse to convert audios with any other speech rates.

Describe how to reproduce:

import torch
from fairseq2.data.audio import WaveformToFbankConverter

# Because the two converters are initialized identically, I expect them to behave identically
converter1 = WaveformToFbankConverter()
converter2 = WaveformToFbankConverter()

# Define two equivalent audios; the second is the first, downsampled.
input1 = {
    "waveform": torch.randn([2, 90_000]),
    "sample_rate": 48000,
    "format": -1,
}
input2 = {
    "waveform": input1['waveform'][:, ::3],
    "sample_rate": 16000,
    "format": -1,
}

converted1_1 = converter1(input1)
converted2_2 = converter2(input2)
# the above conversions work fine, just as expected

# expect the same output as converted2_2
converted1_2 = converter1(input2) 
# ValueError: The input waveform must have a sample rate of 48000, but has a sample rate of 16000 instead.

# expect the same output as converted1_1
converted2_1 = converter2(input1) 
# ValueError: The input waveform must have a sample rate of 16000, but has a sample rate of 48000 instead.

Describe the expected behavior: This implicit dependence of the first input is not expected; a more appropriate behavior would be either to explicitly specify the desired sample rate when initializing the converter, or to support inputs with any speech rate.

Environment: At the very least, specify the versions of fairseq2, PyTorch, Python, and CUDA along with your operating system and, if relevant, GPU model. I am using python3.8, fairseq2==0.2.0, pytorch 2.1.1+cu118. But I believe this is irrelevant.

Additional Context: Add any other context about the bug here.

avidale avatar Feb 19 '24 11:02 avidale