icefall icon indicating copy to clipboard operation
icefall copied to clipboard

Questions about modifying prepare.sh for training ASR model on custom data

Open daocunyang opened this issue 1 year ago • 2 comments

Hi, opening a new issue since the old one has been closed.

Currently, we are writing our own prepare.sh to train an ASR model based on our own Chinese audio data, following the example of aishell's prepare.sh, but given our lack of experience we are unsure about some contents in it, below are the questions:

  1. What role does vocab_sizes play, and how to decide what number we should assign to it? Do we need it?

  2. Looking at stage 5 to stage 8 of Aishell's prepare.sh, from what I can tell, we need to replace aishell_transcript_v0.8.txt (line 151) with our own text file, correct? Other than that, is there anything else we need to modify to prepare our own data during these stages?

  3. We currently have a few hundred audio files for training (not so many), how do you suggest we divide the data for training and test set? I'm thinking of using most or probably all of them for training, and few or even none of them for the test set.

  4. Just to confirm, we can get rid of the part related to Whisper large-v3 at the end of prepare.sh, since we are not using Whisper.

  5. We plan to use the lexicon.txt file from Aishell, but we notice there are certain words which are important to us yet are missing from the current lexicon.txt. For example, we want to add the word "对的" to lexicon.txt. But I wonder if it is necessary to add it to lexicon.txt? Because I noticed the lexicon.txt from Aishell already contains the following, which are the parts that make up the word "对的":

对 d ui4
的 d e5
的 d i2
的 d i4

Thanks in advance.

daocunyang avatar May 23 '24 04:05 daocunyang