lid_kaldi icon indicating copy to clipboard operation
lid_kaldi copied to clipboard

Training & languages

Open Mlallena opened this issue 3 years ago • 12 comments

If I wanted to add new languages to this program, or train the ones already present, how would I have to do it?

Also, you should update the link at the first instruction - I had to replace "latest" for "1.0.1" so I could download it.

Mlallena avatar Jul 08 '21 08:07 Mlallena

Thank you. Unfortunately you have to train your own model for new language. Or you can try https://huggingface.co/TalTechNLP/voxlingua107-epaca-tdnn

igorsitdikov avatar Jul 08 '21 20:07 igorsitdikov

What did you use to train your own model? I'm asking because (unless I missed something) this repository doesn't have any code that is clearly used for training.

Mlallena avatar Jul 09 '21 09:07 Mlallena

have a look #1

igorsitdikov avatar Jul 09 '21 11:07 igorsitdikov

Thanks, I'll have a look.

Mlallena avatar Jul 09 '21 11:07 Mlallena

OK, I have been checking, and it could work. Thing is, from what you said in #1, the only modification you make would be to the utt2spk file, but where would this file be stored? I'm going to go out on a limb and say that it is stored in a data folder within v2, but the main problem is that the run.sh file doesn't refer to that file. I'd also have to modify which corpus it is trying to target, since the audios are in a different folder.

Any help you can give me would be welcome.

Mlallena avatar Jul 12 '21 12:07 Mlallena

Hi Igor, I am training Kaldi recipe on voxlingua data for language identification task but I could not find trials file. Can you please share with me the trials file. Many thanks.

asadullah797 avatar Mar 27 '22 09:03 asadullah797

Hello @asadullah797. You can generate file on your own. It will look something like this:

lang-id-A utt-id-A target lang-id-A utt-id-B nontarget lang-id-A utt-id-C nontarget lang-id-B utt-id-A nontarget lang-id-B utt-id-B target

for 3 files and 3 languages:

en utt-en target en utt-ru nontarget en utt-pl nontarget ru utt-en nontarget ru utt-ru target ru utt-pl nontarget pl utt-en nontarget pl utt-ru nontarget pl utt-pl target

Sorry I don't remember, probably columns 1 and 2 should be swapped

igorsitdikov avatar Mar 28 '22 05:03 igorsitdikov

For lang id task; how can you define

lang-id-A utt-id-B nontarget

I mean how can you decide whether the given utterance is target/non-target. Thanks

asadullah797 avatar Mar 28 '22 06:03 asadullah797

you have dataset with 3 languages, each wav file has only one language, you should have map wav file - language, so it will be target. all other 3 languages will be nontarget for the file.

igorsitdikov avatar Mar 28 '22 06:03 igorsitdikov

Just to confirm; (wav1:>en, wav2:>es, wav3:>de) en wav1 target es wav1 nontarget de wav1 nontarget and so on for other cases as well.

asadullah797 avatar Mar 28 '22 06:03 asadullah797

I think so. But as I wrote before, if it will not work, try to swap columns 1 and 2 like this. Sorry really don't remember. wav1 en target wav1 es nontarget wav1 de nontarget

igorsitdikov avatar Mar 28 '22 06:03 igorsitdikov

Hi Igor; I have prepared trials file using (https://github.com/kaldi-asr/kaldi/blob/master/egs/aishell/v1/local/produce_trials.py) but at the end of the script I am getting this kind of error: Key de__071xs-uBRZo__U__S10---0150.960-0167.120 not present in training iVectors The key is the utterance_id in above. Please note that I have created trials file from test data utt2spk.

asadullah73-ce avatar Mar 31 '22 17:03 asadullah73-ce