audio icon indicating copy to clipboard operation
audio copied to clipboard

Add a DNN beamformer training pipeline to demonstrate usage of torchaudio.transforms.MVDR

Open nateanl opened this issue 3 years ago • 0 comments

🚀 The feature

Recently torchaudio supported mask-based MVDR beamforming module, which takes the multi-channel noisy STFT and the estimated Time-Frequency masks as the input, and generates the single-channel enhanced STFT as the output. Thanks to the complex tensor support of PyTorch, it can be integrated to mask-based multi-channel speech enhancement models to enable end-to-end training.

Adding an end-to-end training pipeline will benefit users for understanding the usage of MVDR module, and providing a baseline framework to help researchers easily compare with their novel research methods.

Motivation, pitch

To make the training pipeline runnable for every user, using open-source dataset is a fair choice. We propose to use the dataset in task1 of L3DAS challenge as the current choice.

The training pipeline is like:

Spectrogram() -> MaskGenerator in ConvTasNet -> MVDR() -> InverseSpectrogram() -> Loss function()

The inputs are the multi-channel noisy waveforms and the output is the single-channel enhanced waveform.

The loss function can be Scale-Invariant Signal-to-Distortion Ratio (Si-SDR) or the recently proposed Convolutive transfer function Invariant Signal-to-Distortion Ratio (Ci-SDR) if the target clean speech is not aligned with the far-field noisy speech.

Many thanks to @sw005320, @Emrys365, @popcornell for research advices and great help!

nateanl avatar Nov 09 '21 14:11 nateanl