kaldi Spanish Gigaword text based POCOLM and RNNLM training recipe

We introduce the following features into the existing fisher_spanish recipe:

Text processing (optional) scripts for Spanish Gigaword text corpus.
Train a 3-gram POCOLM using the Fisher train and gigaword texts.
POCOLM wordlist is derived using relative frequency of words in each corpus and metaparameter weights of each text corpus.
OOVs of the POCOLM wordlist are added to ASR lexicon and RNNLM wordlist using seq2seq transformer based G2P model - https://github.com/cmusphinx/g2p-seq2seq
RNNLM (optional) is trained, using the two text corpora as train sets and Fisher dev2 partition as dev set, for 5 epochs.
During the test sets decode stage, after training the chain model, the extended ASR lexicon above is used to derive the graph. And RNNLM rescoring is also performed using the trained Gigaword RNNLM.

We achieved WER 20.84% WER on the Fisher Spanish test partition and 24.67% WER on Fisher dev partition, using the Gigaword text-based trained RNNLM rescoring, over the baseline 3-gram LM based decoded lattices.

Mar 18 '19 14:03 saikiranvalluri

s there a reason it doesn't make sense to just replace the current example with this? I doubt too many people were using the old example.

Are you using a graphemic or phonemic lexicon? A graphemic lexicon might be a reasonable choice in Spanish, for simplification.

Mar 19 '19 18:03 danpovey

s there a reason it doesn't make sense to just replace the current example with this? I doubt too many people were using the old example.

I included the e2e process of processing the Spanish Gigaword corpus downloaded, to training RNNLM using that data in stages 0,1 in run.sh. Also, you see > 0.4% absolute WER improvement on test partitions upon adding the Spanish Gigaword text to RNNLM training data. The Gigaword based rnnlm might prove more significant for WER improvement in the extended lexicon scenario and on more generalised test sets.

Are you using a graphemic or phonemic lexicon? A graphemic lexicon might be a reasonable choice in Spanish, for simplification.

I am using the same Callhome Spanish rules based lexicon, which is simplified to 36 phones, after removing accented letters and digits from the non-silence phones list. So, it is similar to graphemic lexicon.

Mar 24 '19 05:03 saikiranvalluri

I think I would like you to extend the existing recipe rather than starting a new one. Can you do that?

Mar 24 '19 15:03 danpovey

I think I would like you to extend the existing recipe rather than starting a new one. Can you do that?

Sure sir. Before that let me run the script e2e and make sure we get a better WER using the gigaword RNNLM, than the original s5 recipe even with an extended vocabulary. I will run the recipe e2e and get back to you on this topic soon.

Mar 25 '19 06:03 saikiranvalluri

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jun 19 '20 08:06 stale[bot]

This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it.

Jul 19 '20 05:07 stale[bot]

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

Sep 17 '20 08:09 stale[bot]

@saikiranvalluri, where we are on this?

Sep 22 '21 21:09 kkm000

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

Nov 22 '21 16:11 stale[bot]