SemanticMask
SemanticMask copied to clipboard
The repo contains our code of ``Semantic Mask for Transformer based End-to-End Speech Recognition"
ASR_SemanticMask
The repo contains our code of ``Semantic Mask for Transformer based End-to-End Speech Recognition"
Preparation
We already build a runnable docker, you can run the following command to download and run the docker
docker run -it --volume-driver=nfs --shm-size=64G j4ckl1u/espnet-py36-img:latest /bin/bash
Regarding data preparation, I suggest you read ESPnet instructions. It should be note that espnet doesn't do speed perturbation, but I strongly recommend to do it according to the better performance on dev-other and test-other datasets.
Word Alignment
To enable semantic mask training, you have to align audio and word.
In our work, we use the alignment results released by this repo, which is obtained using Montreal Forced Aligner. We put the extracted information on data directory. start.txt and end.txt record the alignment position in frame for each word in each utterance.
Training and decoding
For training, I upload my training configs into configs folder, including base setting and large setting respectively. Our archtecture is similar to ESPnet, but replacing position embedding with CNN in both encoder and decoder. The specific code change can be found at here
In terms of decoding, pleaes first download the ESPnet pre-trained RNN language model, and then run our decoding script to get the model output.
Pre-train Models
We release a base model (12 encoder layers and 6 decoder layers) and a large model (24 encoder layers and 12 decoder layers). It achevies following results with shallow language model fusion setting.
| dev-clean | dev-other | test-clean | test-other | |
|---|---|---|---|---|
| Base | 2.07 | 5.06 | 2.31 | 5.21 |
| Large | 2.02 | 4.91 | 2.19 | 5.19 |