ASR_SemanticMask

The repo contains our code of ``Semantic Mask for Transformer based End-to-End Speech Recognition"

Preparation

We already build a runnable docker, you can run the following command to download and run the docker

docker run -it --volume-driver=nfs --shm-size=64G j4ckl1u/espnet-py36-img:latest /bin/bash

Regarding data preparation, I suggest you read ESPnet instructions. It should be note that espnet doesn't do speed perturbation, but I strongly recommend to do it according to the better performance on dev-other and test-other datasets.

Word Alignment

To enable semantic mask training, you have to align audio and word. In our work, we use the alignment results released by this repo, which is obtained using Montreal Forced Aligner. We put the extracted information on data directory. start.txt and end.txt record the alignment position in frame for each word in each utterance.

Training and decoding

For training, I upload my training configs into configs folder, including base setting and large setting respectively. Our archtecture is similar to ESPnet, but replacing position embedding with CNN in both encoder and decoder. The specific code change can be found at here

In terms of decoding, pleaes first download the ESPnet pre-trained RNN language model, and then run our decoding script to get the model output.

Pre-train Models

We release a base model (12 encoder layers and 6 decoder layers) and a large model (24 encoder layers and 12 decoder layers). It achevies following results with shallow language model fusion setting.

	dev-clean	dev-other	test-clean	test-other
Base	2.07	5.06	2.31	5.21
Large	2.02	4.91	2.19	5.19

SemanticMask
SemanticMask copied to clipboard

Metadata

ASR_SemanticMask

Preparation

Word Alignment

Training and decoding

Pre-train Models

← Metadata

Owner

Metadata

SemanticMask SemanticMask copied to clipboard

Metadata

ASR_SemanticMask

Preparation

Word Alignment

Training and decoding

Pre-train Models

← Metadata

Owner

Metadata

SemanticMask
SemanticMask copied to clipboard