commonvoice-th
commonvoice-th copied to clipboard
Kaldi recipe to train commonvoice corpus in Thai language
CommonVoice-TH Recipe
A commonvoice-th recipe for training ASR engine using Kaldi. The following recipe follows commonvoice
recipe with slight modification
Installation
The author use docker to run the container. GPU is required to train tdnn_chain
, else the script can train only up to tri3b
.
Downloading Commonvoice Corpus
We will need a commonvoice corpus for training ASR Engine. We are using Commonvoice Corpus 7.0 in Thai language which can be download here. Once downloaded, unzip it as we will use it later to mount dataset to the docker container.
Downloading SRILM
Before building docker, SRILM file need to be downloaded. You can download it from here. Once the file is downloaded, remove version name (e.g. from srilm-1.7.3.tar.gz
to srilm.tar.gz
and place it inside docker
directory. Your docker
directory should contains 2 files: dockerfile
, and srilm.tar.gz
.
Building Docker for Training with Kaldi
Once you have prepared SRILM file, you are ready to build docker for training this recipe. This docker automatically install project's dependendies and stored it in an image. To build a docker image, run:
$ cd docker
$ docker build -t <docker-name> kaldi
Run docker and attach command line
Once the image had been built, all you have to do is interactively attach to its bash terminal via the following command:
$ docker run -it -v <path-to-repo>:/opt/kaldi/egs/commonvoice-th \
-v <path-to-repo>/labels:/mnt/labels \
-v <path-to-cv-corpus>:/mnt \
--gpus all --name <container-name> <built-docker-name> bash
Once you finish this step, you should be in a docker container's bash terminal now
Building Docker for inferencing via Vosk
We also provide an example of how to inference a trained kaldi model using Vosk. Berore we begin, let's build Vosk docker image:
$ cd docker
$ docker build -t <docker-name> vosk-inference
$ cd .. # back to root directory
Preparing Directories for Vosk Inferencing
The first step is to download provided Vosk model format on this github's release. Unzip it to vosk-inference
directory. Or you can just follow this code.
$ cd vosk-inference
$ wget https://github.com/vistec-AI/commonvoice-th/releases/download/vosk-v1/model.zip
$ unzip model.zip
Run docker and test inference script
To prevent dependencies problem, the Vosk inference python script must be run inside a docker image that we just built. First, let's initiate a docker
$ docker run -it -v <path-to-repo>:/workspace \
--name <container-name> \
-p 8000:8000 \
<build-docker-name> bash
Then, you will be attached to a linux terminal inside the container. To inference an audio file, run:
$ cd vosk-inference
$ python3.8 inference.py --wav-path <path-to-wav> # test it with test.wav
Note that audio file must be 16k samping rate and mono channel!
Instaltiating Vosk Server to Processing audio files
We also provide a fastapi
server that will allow user to transcribe their own audio file via RESTful API. To instantiate server, run this command inside a docker shell
$ cd vosk-inference
$ uvicorn server:app --host 0.0.0.0 --reload
Now, the server will instantiate at http://localhost:8000
. To see if server is correctly instantiated, try to browse http://localhost:8000/healthcheck
. If the webpage loaded then we are good to go!
API Endpoint
The endpoint will be in form-data format where each file is attached to a form field named audios
. See python example
import requests
url = "localhost:8000/transcribe"
payload={}
files=[
('audios', (<file-name>, open(<file-path>, 'rb'), 'audio/wav')),
...
]
headers = {}
response = requests.request("POST", url, headers=headers, data=payload, files=files)
print(response.text)
Online Decoding with WebRTC Protocol
Read more at this repository. The provided repository contains an easy way to deploy Kaldi tdnn-chain
model to webRTC server.
Usage
To run the training pipeline, go to recipe directory and run run.sh
script
$ cd /opt/kaldi/egs/commonvoice-th/s5
$ ./run.sh --stage 0
Experiment Results
Here are some experiment results evaluated on dev set:
Model | dev | dev-unique | ||
---|---|---|---|---|
WER | CER | WER | CER | |
mono | 79.13% | 57.31% | 77.79% | 48.97% |
tri1 | 56.55% | 37.88% | 53.26% | 27.99% |
tri2b | 50.64% | 32.85% | 47.38% | 21.89% |
tri3b | 50.52% | 32.70% | 47.06% | 21.67% |
tri4b | 46.81% | 29.47% | 43.18% | 18.05% |
tdnn-chain | 29.15% | 14.96% | 30.84% | 8.75% |
tdnn-chain-online | 29.02% | 14.64% | 30.41% | 8.28% |
Here is final test
set result evaluated on tdnn-chain
Model | test | test-unique | ||
---|---|---|---|---|
WER | CER | WER | CER | |
tdnn-chain-online | 9.71% | 3.12% | 23.04% | 7.57% |
airesearch/wav2vec2-xlsr-53-th | - | - | 13.63 | 2.81% |
Google Web Speech API | - | - | 13.71% | 7.36% |
Microsoft Bing Search API | - | - | 12.58% | 5.01% |
Amazon Transcribe | - | - | 21.86% | 7.08% |
Author
Chompakorn Chaksangchaichot