DeepSpeech icon indicating copy to clipboard operation
DeepSpeech copied to clipboard

Enable multiple input txts

Open nmstoker opened this issue 3 years ago • 5 comments

Further to discussion here, this is my PR for enabling generate_lm.py to accept multiple input texts which are combined into a single lm.binary output for onward creation of a scorer.

Parameters remain unchanged so shouldn't impact current code, but if you wish to use multiple input texts, you simply pass in multiple --input_txt parameters, like so:

python generate_lm.py --input_txt input_text_src1.txt.gz --input_txt input_text_src2.txt.gz --output_dir . --top_k 10000 --top_k 20000 --kenlm_bins path/to/kenlm/build/bin/ --arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" --binary_a_bits 255 --binary_q_bits 8 --binary_type trie --discount_fallback

As per the above example, you can also have corresponding top_k parameters for each input_txt. If you provider fewer top_k parameters (eg just one) then the last one will be re-used for each subsequent input_txt.

BTW: because of using repeating parameters, I didn't need the delimited for inputs I'd mentioned in the Discourse post, and this keeps it pretty simple overall.

As well as the above, I also added a simple parameter check step that runs initially (simply to avoid situation where you have a trivial error in parameters and have to wait for the first processing step on a large text file to complete before you realise your mistake and have to run it all again!)

And I switched it to use f-strings over .format (hope that's okay?)

Documentation: If this is accepted, provided people agree I plan to put a few extra basic details relating to this in the docs (eg here) so I'd get a corresponding PR ready for that shortly too. And more generally, I had some ideas I was going to share w/ @KathyReid for the Playbook relating to this.

nmstoker avatar Feb 25 '21 14:02 nmstoker

Oh yeah, we will also have to cover this usecase in CI, you might want to have a look at taskcluster/tc-scorer-tests.sh: https://github.com/mozilla/DeepSpeech/blob/385c8c769bc4aed5e6979a239591486c44f3471d/taskcluster/tc-scorer-tests.sh

lissyx avatar Feb 25 '21 19:02 lissyx

Quick update: I got unexpectedly delayed with some things that came up, so I made less progress than I'd hoped but I'll continue it in the evenings this week/next weekend to get this sorted.

nmstoker avatar Feb 28 '21 23:02 nmstoker

Quick update: I got unexpectedly delayed with some things that came up, so I made less progress than I'd hoped but I'll continue it in the evenings this week/next weekend to get this sorted.

Gentle ping?

lissyx avatar Mar 11 '21 18:03 lissyx

Sorry, I haven't had a great deal of time so this isn't likely to proceed much for a bit. I'll try a bit more next weekend

nmstoker avatar Mar 15 '21 01:03 nmstoker

Sorry, I haven't had a great deal of time so this isn't likely to proceed much for a bit. I'll try a bit more next weekend

@nmstoker Gentle ping? It's okay if you don't have time ;)

lissyx avatar Apr 06 '21 15:04 lissyx