audio
audio copied to clipboard
Add a DNN beamformer training pipeline to demonstrate usage of torchaudio.transforms.MVDR
🚀 The feature
Recently torchaudio supported mask-based MVDR beamforming module, which takes the multi-channel noisy STFT and the estimated Time-Frequency masks as the input, and generates the single-channel enhanced STFT as the output. Thanks to the complex tensor support of PyTorch, it can be integrated to mask-based multi-channel speech enhancement models to enable end-to-end training.
Adding an end-to-end training pipeline will benefit users for understanding the usage of MVDR module, and providing a baseline framework to help researchers easily compare with their novel research methods.
Motivation, pitch
To make the training pipeline runnable for every user, using open-source dataset is a fair choice. We propose to use the dataset in task1 of L3DAS challenge as the current choice.
The training pipeline is like:
Spectrogram() -> MaskGenerator in ConvTasNet -> MVDR() -> InverseSpectrogram() -> Loss function()
The inputs are the multi-channel noisy waveforms and the output is the single-channel enhanced waveform.
The loss function can be Scale-Invariant Signal-to-Distortion Ratio (Si-SDR) or the recently proposed Convolutive transfer function Invariant Signal-to-Distortion Ratio (Ci-SDR) if the target clean speech is not aligned with the far-field noisy speech.
Many thanks to @sw005320, @Emrys365, @popcornell for research advices and great help!