logo

Visual Speech Recognition for Multiple Languages

Authors

Pingchuan Ma, Stavros Petridis, Maja Pantic.

Introduction

This is the repository of Visual Speech Recognition for Multiple Languages, which is the successor of End-to-End Audio-Visual Speech Recognition with Conformers. The repository is mainly based on ESPnet. We provide state-of-the-art algorithms for end-to-end visual speech recognition in the wild.

Major features

Modular Design

The repository is composed of face tracking, pre-processing, and acoustic/visual encoder backbones.
Support of Benchmarks for Speech Recognition

Our models provide state-of-the-art performance for speech recognition datasets.
Support of Extraction of Representations or Mouth Region Of Interest

Our models directly support extraction of speech representations or mouth region of interests (ROIs).
Support of Recognition of Your Own Videos

We provide support for performing visual speech recognition for your own videos.

Demo

English -> Mandarin -> Spanish	French -> Portuguese -> Italian

Youtube | Bilibili

Installation

How to Install Environments

Clone the repository into a directory. We refer to that directory as ${lipreading_root}.

git clone https://github.com/mpc001/Visual_Speech_Recognition_for_Multiple_Languages

Install PyTorch (>=1.8.0)
Install other packages.

pip install -r requirements.txt

How to Prepare Models and Landmarks

Model. Download a model from Model Zoo.
- For models trained on the CMU-MOSEAS dataset, which contains multiple languages, please unzip them into ${lipreading_root}/models/${dataset}/${language_code} (e.g. ${lipreading_root}/models/CMUMOSEAS/pt).
- For models trained on a dataset with one language, please unzip them into ${lipreading_root}/models/${dataset}.
Language Model. The performance can be improved in most cases by incorporating an external language model. Please download a language model from Model Zoo.
- For a language model trained for the CMU-MOSEAS dataset, please unzip them into ${lipreading_root}/language_models/${dataset}/${language_code}.
- For a language model trained for datasets with one language, please unzip them into ${lipreading_root}/language_models/${dataset}.
Tracker [option]. If you intend to test your own videos, additional packages for face detection and face alignment need to be pre-installed, which are provided in the tools folder.
Landmarks [option]. If you want to evaluate on benchmarks, there is no need to install the tracker. Please download pre-computed landmarks from Model Zoo and unzip them into ${lipreading_root}/landmarks/${dataset}.

Recognition

Generic Options

We refer to a path name (.ini) that includes configuration information as <CONFIG-FILENAME-PATH>. We put configuration files in ${lipreading_root}/configs by default.
We refer to a path name (.ref) that includes labels information as <LABELS-FILENAME-PATH>.
- For the CMU-MOSEAS dataset and Multilingual TEDx dataset, which include multiple languages, we put labels files (.ref) in ${lipreading_root}/labels/${dataset}/${language_code}.
- For datasets with one language, we put label files in ${lipreading_root}/labels/${dataset}.
We refer to the original dataset directory as <DATA-DIRECTORY-PATH>, and to the path name of a single original video as <DATA-FILENAME-PATH>.
We refer to the landmarks diectory as <LANDMARKS-DIRECTORY-PATH>. We assume the default directory is ${lipreading_root}/landmarks/${dataset}/${dataset}_landmarks.
We use CPU for inference by default. If you want to speed up the decoding process, please consider
- adding a command-line argument about the GPU option (e.g. --gpu-idx <GPU_ID>). <GPU_ID> is the ID of your selected GPU, which is a 0-based integer.
- setting beam_size in the configuration filename (.ini) <CONFIG-FILENAME-PATH> to a small value (e.g. 5) in case your maximum GPU Memory is exceeded.

How to Test

We assume original videos from desired dataset have been downloaded to the dataset directory <DATA-DIRECTORY-PATH> and landmarks have been unzipped to the landmark directory ${lipreading_root}/landmarks/${dataset}.
The frame rate (fps) of your video should match the input v_fps in the configuration file.

To evaluate the performance on desired dataset.

python main.py --config-filename <CONFIG-FILENAME-PATH> \
               --labels-filename <LABELS-FILENAME-PATH> \
               --data-dir <DATA-DIRECTORY-PATH> \
               --landmarks-dir <LANDMARKS-DIRECTORY-PATH>

To lip read from a single video file.

python main.py --config-filename <CONFIG-FILENAME-PATH> \
               --data-filename <DATA-FILENAME-PATH>

How to Extract Mouth ROIs

Mouth ROIs can be extracted by setting <FEATS-POSITION> to mouth. The mouth ROIs will be saved to <OUTPUT-FILENAME-PATH> with the .avi file extension.
The ${lipreading_root}/outputs folder can be used to save the mouth ROIs.

To extract mouth ROIs from desired dataset.

python main.py --labels-filename <LABELS-FILENAME-PATH> \
               --data-dir <DATA-DIRECTORY-PATH> \
               --landmarks-dir <LANDMARKS-DIRECTORY-PATH> \
               --dst-dir <OUTPUT-DIRECTORY-PATH> \
               --feats-position <FEATS-POSITION>

To extract mouth ROIs from a single video file.

python main.py --data-filename <DATA-FILENAME-PATH> \
               --dst-filename <OUTPUT-FILENAME-PATH> \
               --feats-position <FEATS-POSITION>

How to Extract Speech Representations

Speech representations can be extracted from the top of ResNet-18 (512-D) or Conformer (256-D) by setting <FEATS-POSITION> to resnet or conformer, respetively. The representations will be saved to <OUTPUT-DIRECTORY-PATH> or <OUTPUT-FILENAME-PATH> with the .npz file extension.
The ${lipreading_root}/outputs folder can be used to save the speech representations.

To extract speech representations from desired dataset.

python main.py --config-filename <CONFIG-FILENAME-PATH> \
               --labels-filename <LABELS-FILENAME-PATH> \
               --data-dir <DATA-DIRECTORY-PATH> \
               --landmarks-dir <LANDMARKS-DIRECTORY-PATH> \
               --dst-dir <OUTPUT-DIRECTORY-PATH> \
               --feats-position <FEATS-POSITION>

To extract speech representations from a single video file.

python main.py --config-filename <CONFIG-FILENAME-PATH> \
               --data-filename <DATA-FILENAME-PATH> \
               --dst-filename <OUTPUT-FILENAME-PATH> \
               --feats-position <FEATS-POSITION>

Model Zoo

Overview

We support a number of datasets for speech recognition:

[x] Lip Reading Sentences 2 (LRS2)
[x] Lip Reading Sentences 3 (LRS3)
[x] Chinese Mandarin Lip Reading (CMLR)
[x] CMU Multimodal Opinion Sentiment, Emotions and Attributes (CMU-MOSEAS)
[x] GRID
[x] Lombard GRID
[x] TCD-TIMIT

Evaluation

We provide landmarks, language models, models for each dataset. Please see the models page for details.

Citation

If you find this code useful in your research, please consider citing the following papers:

@article{ma2022visual,
  title={{Visual Speech Recognition for Multiple Languages in the Wild}},
  author={Ma, Pingchuan and Petridis, Stavros and Pantic, Maja},
  journal={{arXiv Preprint: 2202.13084}},
  year={2022}
}

License

It is noted that the code can only be used for comparative or benchmarking purposes. Users can only use code supplied under a License for non-commercial purposes.

Contact

[Pingchuan Ma](pingchuan.ma16[at]imperial.ac.uk)

Visual_Speech_Recognition_for_Multiple_Languages
Visual_Speech_Recognition_for_Multiple_Languages copied to clipboard

Metadata

Visual Speech Recognition for Multiple Languages

Authors

Introduction

Demo

Installation

How to Install Environments

How to Prepare Models and Landmarks

Recognition

Generic Options

How to Test

How to Extract Mouth ROIs

How to Extract Speech Representations

Model Zoo

Overview

Evaluation

Citation

License

Contact

← Metadata

Owner

Metadata

Visual_Speech_Recognition_for_Multiple_Languages Visual_Speech_Recognition_for_Multiple_Languages copied to clipboard

Metadata

Visual Speech Recognition for Multiple Languages

Authors

Introduction

Demo

Installation

How to Install Environments

How to Prepare Models and Landmarks

Recognition

Generic Options

How to Test

How to Extract Mouth ROIs

How to Extract Speech Representations

Model Zoo

Overview

Evaluation

Citation

License

Contact

← Metadata

Owner

Metadata

Visual_Speech_Recognition_for_Multiple_Languages
Visual_Speech_Recognition_for_Multiple_Languages copied to clipboard