fairseq2
fairseq2 copied to clipboard
Using WaveformToFbankConverter with variable sample rates is impossible
Describe the bug: The converter seems to stick to the first sample rate that has been fed into it, and refuse to convert audios with any other speech rates.
Describe how to reproduce:
import torch
from fairseq2.data.audio import WaveformToFbankConverter
# Because the two converters are initialized identically, I expect them to behave identically
converter1 = WaveformToFbankConverter()
converter2 = WaveformToFbankConverter()
# Define two equivalent audios; the second is the first, downsampled.
input1 = {
"waveform": torch.randn([2, 90_000]),
"sample_rate": 48000,
"format": -1,
}
input2 = {
"waveform": input1['waveform'][:, ::3],
"sample_rate": 16000,
"format": -1,
}
converted1_1 = converter1(input1)
converted2_2 = converter2(input2)
# the above conversions work fine, just as expected
# expect the same output as converted2_2
converted1_2 = converter1(input2)
# ValueError: The input waveform must have a sample rate of 48000, but has a sample rate of 16000 instead.
# expect the same output as converted1_1
converted2_1 = converter2(input1)
# ValueError: The input waveform must have a sample rate of 16000, but has a sample rate of 48000 instead.
Describe the expected behavior: This implicit dependence of the first input is not expected; a more appropriate behavior would be either to explicitly specify the desired sample rate when initializing the converter, or to support inputs with any speech rate.
Environment: At the very least, specify the versions of fairseq2, PyTorch, Python, and CUDA along with your operating system and, if relevant, GPU model. I am using python3.8, fairseq2==0.2.0, pytorch 2.1.1+cu118. But I believe this is irrelevant.
Additional Context: Add any other context about the bug here.