CTCNet
                                
                                 CTCNet copied to clipboard
                                
                                    CTCNet copied to clipboard
                            
                            
                            
                        An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits
An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits
Kai Li, Fenghua Xie, Hang Chen, Kexin Yuan, and Xiaolin Hu | Tsinghua University
PyTorch Implementation of CTCNet (TPAMI 2024): An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits.
Audio-visual demos
https://user-images.githubusercontent.com/33806018/208616615-dab6ab87-def1-405a-897e-a3c1decb790a.mp4
Key points
- The performance of multimodal speech separation is greatly improved.
- Incorporating brain inspiration into network design to improve model performance.
- For real scenes can still get better results.
Quick Started
Datasets and Pretrained Models
This method involves using the LRS2, LRS3, and Vox2 datasets to create a multimodal speech separation dataset. The corresponding folders Datasets/ in the provided GitHub repository contain the files necessary to build the datasets, and the code in the repository can be used to construct the multimodal datasets.
The generated datasets (LRS2-2Mix, LRS3-2Mix, and VoxCeleb2-2Mix) can be downloaded at the links below.
| Datasets | Links | Pretrained Models | 
|---|---|---|
| LRS2-2Mix | Removed for copyright | Google Driver | 
| LRS3-2Mix | Removed for copyright | Google Driver | 
| VoxCeleb2-2Mix | Removed for copyright | Google Driver | 
Video Pretrain model
This pre-trained model is a lip-reading model trained only on videos, and it achieves an accuracy of 84% on the LRW dataset.
| Datasets | Links | Pretrained Models | 
|---|---|---|
| LRS2-2Mix | Removed for copyright | Google Driver | 
Dependencies
- torch 1.13.1+cu116
- torchaudio 0.13.1+cu116
- torchvision 0.14.1+cu116
- pytorch-lightning 1.8.4.post0
- torch-mir-eval 0.4
- torch-optimizer 0.3.0
- fast-bss-eval 0.1.4
- pandas 1.5.1
- rich 10.16.2
- opencv-python 4.6.0.66
Preprocess
python preprocess_lrs2.py --in_audio_dir audio/wav16k/min --in_mouth_dir mouths --out_dir data
Training Pipeline
Training on the LRS2
python train.py -c local/lrs2_conf_64_64_3_adamw_1e-1_blocks16_pretrain.yml
Training on the LRS3
python train.py -c local/lrs3_conf_64_64_3_adamw_1e-1_blocks16_pretrain.yml
Training on the VoxCeleb2
python train.py -c local/vox2_conf_64_64_3_adamw_1e-1_blocks16_pretrain.yml
Testing Pipeline
python eval.py --test=local/data/tt --conf_dir=exp/lrs2_64_64_3_adamw_1e-1_blocks8_pretrain/conf.yml
Testing Your Own Videos
ffmpeg -i ./test_videos/interview.mp4 -filter:v fps=fps=25 ./test_videos/interview25fps.mp4
mv ./test_videos/interview25fps.mp4 ./test_videos/interview.mp4
python ./utils/detectFaces.py --video_input_path ./test_videos/interview.mp4 --output_path ./test_videos/interview/ --number_of_speakers 2 --scalar_face_detection 1.5 --detect_every_N_frame 8
ffmpeg -i ./test_videos/interview.mp4 -vn -ar 16000 -ac 1 -ab 192k -f wav ./test_videos/interview/interview.wav
python ./utils/crop_mouth_from_video.py --video-direc ./test_videos/interview/faces/ --landmark-direc ./test_videos/interview/landmark/ --save-direc ./test_videos/interview/mouthroi/ --convert-gray --filename-path ./test_videos/interview/filename_input/interview.csv
Acknowledgements
This implementation uses parts of the code from the following Github repos: Asteroid as described in our code.
Citations
If you find this code useful in your research, please cite our work:
@article{li2024audio,
  title={An audio-visual speech separation model inspired by cortico-thalamo-cortical circuits},
  author={Li, Kai and Xie, Fenghua and Chen, Hang and Yuan, Kexin and Hu, Xiaolin},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2024},
  publisher={IEEE}
}