fairseq2 icon indicating copy to clipboard operation
fairseq2 copied to clipboard

controllable `WaveformToFbankConverter` multithreading

Open artemru opened this issue 7 months ago • 0 comments

Describe the bug: WaveformToFbankConverter is running in multithread parallel. This method (as possibly some others) uses parallel_for statement for the execution. Currently, there's no obvious way to control the number of threads it uses. That could lead to some performance drawback (like threads/cpu oversubscription). Moreover, it turns out that even when used inside DataPipeline.map(...), it does not respect the number of required parallel calls.

Describe how to reproduce:

from fairseq2.data.data_pipeline import read_sequence
from fairseq2.data.audio import WaveformToFbankConverter

_convert_to_fbank = WaveformToFbankConverter(
                num_mel_bins=80,
                waveform_scale=2**15,
                channel_last=True,
                standardize=True,
                device=torch.device("cpu"),
                dtype=torch.float16)
def convert_to_fbank(wav):
    return _convert_to_fbank({"waveform": torch.unsqueeze(wav, 1),
                              "sample_rate": 16_000})['fbank'].shape

xx = [torch.rand(10**5) for i in range(100)]
data_pipeline = read_sequence(xx).map(convert_to_fbank, num_parallel_calls=1).and_return()
list(iter(data_pipeline))  # this will use typically half of available cpus

Describe the expected behavior:

  • a context manager to control number of threads the method uses (with fairseq2_nb_threads(2): ...)
  • make WaveformToFbankConverter respect num_parallel_calls in data pipelining

Environment: fairseq2==0.1.1+cu118 fairseq2n==0.1.1+cu118

artemru avatar Nov 16 '23 13:11 artemru