icefall
icefall copied to clipboard
Questions about modifying prepare.sh for training ASR model on custom data
Hi, opening a new issue since the old one has been closed.
Currently, we are writing our own prepare.sh to train an ASR model based on our own Chinese audio data, following the example of aishell's prepare.sh, but given our lack of experience we are unsure about some contents in it, below are the questions:
-
What role does vocab_sizes play, and how to decide what number we should assign to it? Do we need it?
-
Looking at stage 5 to stage 8 of Aishell's
prepare.sh, from what I can tell, we need to replaceaishell_transcript_v0.8.txt(line 151) with our owntextfile, correct? Other than that, is there anything else we need to modify to prepare our own data during these stages? -
We currently have a few hundred audio files for training (not so many), how do you suggest we divide the data for training and test set? I'm thinking of using most or probably all of them for training, and few or even none of them for the test set.
-
Just to confirm, we can get rid of the part related to Whisper large-v3 at the end of
prepare.sh, since we are not using Whisper. -
We plan to use the
lexicon.txtfile from Aishell, but we notice there are certain words which are important to us yet are missing from the current lexicon.txt. For example, we want to add the word"对的"tolexicon.txt. But I wonder if it is necessary to add it to lexicon.txt? Because I noticed thelexicon.txtfrom Aishell already contains the following, which are the parts that make up the word"对的":
对 d ui4
的 d e5
的 d i2
的 d i4
Thanks in advance.