vall-e icon indicating copy to clipboard operation
vall-e copied to clipboard

Preparation Script for Training on Mozilla Commonvoice

Open RuntimeRacer opened this issue 1 year ago • 17 comments

This PR provides an end-to-end preparation script for Mozilla CommonVoice.

I built it by copying over the Scripts from AIShell and combining it with the preparation scripts for commonvoice found in Icefall which is also using Lhotse. References:

  • https://github.com/k2-fsa/icefall/tree/master/egs/commonvoice/ASR
  • https://github.com/lhotse-speech/lhotse/blob/master/lhotse/recipes/commonvoice.py

Some additional Info and stats:

  • The data for the 24 languages included in the script (there's even more available in the full CV Corpus) is 432G and downloading and extracting the archives took about 12h with my 200 Mbps connection, using a Raid 0 drive consisting of 2xPCI-E x4 .M2 SSDs.
  • Preparation + Tokenization also took about another day.
  • I had to cut down the train/dev datasets of all the languages downloaded to use 400 sample each from their dev and train subsets, because otherwise it would have become too big and get stale in a loop on the validation step. Even with the now 9.600 cuts per dev/train set, it takes about 30 seconds to calculate validation loss. In case you want to train on smaller subset of languages, you may want to increase that number or use a complete train/dev set from that language(s).
  • I was able to run inference training fine with up to 5 GPUs, however there seems to be still a bug in the validation calculation (https://github.com/lifeiteng/vall-e/issues/86), which required me to only use 1 card as of now, and I hit an OOM error (https://github.com/lifeiteng/vall-e/issues/110) after ~164k steps, probably due to max-duration 80 being too high for this dataset (running on RTX 3090 24GB).

Since I did not finish training yet, I cannot provide any sample models, results or stats at this point.

RuntimeRacer avatar May 01 '23 17:05 RuntimeRacer