speech_separation_PIT
speech_separation_PIT copied to clipboard
The Simple project to separate mixed voice. Using "Permutation Invariant Training Loss" and "PairWise Neg SisDr Loss"
Speech Separation
The simple project to separate mixed voice (2 clean voices) to 2 separate voices.
Result Example (Clisk to hear the voices): mix || prediction voice1 || prediction voice2
Mix Spectrogram
Predict Voice1's Spectrogram
Predict Voice2's Spectrogram
1. Quick train
Step 1:
Download LibriMixSmall, extract it and move it to the root of the project.
Step 2:
./train.sh
It will take about ONLY 2-3 HOURS to train with normal GPU. After each epoch, the prediction is generated to ./viz_outout
folder.
2. Quick inference
./inference.sh
The result will be generated to ./viz_outout
folder.
3. More detail
-
Input: The Complex spectrogram. Get from the raw mixed audio signal
-
Output: The complex ratio mask (cRM) ---> complex spectrogram ---> separated voices.
-
Model: Use the simple version of this implementation , which is defined in paper Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
-
Loss function: Permutation Invariant Training Loss and PairWise Neg SisDr Loss (more SOTA)
-
Dataset: A small version of
LibriMix
dataset. I get from LibriMixSmall
4. Current problem
Due to small dataset size for fast training, the model is a bit overfitting to the training set. Use the bigger dataset will potentially help to overcome that. Some suggestions:
- Use the original LibriMix Dataset which is way much bigger (around 60 times bigger that what I have trained).
- Use this work to download much more in-the-wild dataset and use
datasets/VoiceMixtureDataset.py
instead of the Libri one that I am using. p/s I have trained and it work too.