DeepSpeech
DeepSpeech copied to clipboard
Enable multiple input txts
Further to discussion here, this is my PR for enabling generate_lm.py to accept multiple input texts which are combined into a single lm.binary output for onward creation of a scorer.
Parameters remain unchanged so shouldn't impact current code, but if you wish to use multiple input texts, you simply pass in multiple --input_txt parameters, like so:
python generate_lm.py --input_txt input_text_src1.txt.gz --input_txt input_text_src2.txt.gz --output_dir . --top_k 10000 --top_k 20000 --kenlm_bins path/to/kenlm/build/bin/ --arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" --binary_a_bits 255 --binary_q_bits 8 --binary_type trie --discount_fallback
As per the above example, you can also have corresponding top_k parameters for each input_txt. If you provider fewer top_k parameters (eg just one) then the last one will be re-used for each subsequent input_txt.
BTW: because of using repeating parameters, I didn't need the delimited for inputs I'd mentioned in the Discourse post, and this keeps it pretty simple overall.
As well as the above, I also added a simple parameter check step that runs initially (simply to avoid situation where you have a trivial error in parameters and have to wait for the first processing step on a large text file to complete before you realise your mistake and have to run it all again!)
And I switched it to use f-strings over .format (hope that's okay?)
Documentation: If this is accepted, provided people agree I plan to put a few extra basic details relating to this in the docs (eg here) so I'd get a corresponding PR ready for that shortly too. And more generally, I had some ideas I was going to share w/ @KathyReid for the Playbook relating to this.
Oh yeah, we will also have to cover this usecase in CI, you might want to have a look at taskcluster/tc-scorer-tests.sh
: https://github.com/mozilla/DeepSpeech/blob/385c8c769bc4aed5e6979a239591486c44f3471d/taskcluster/tc-scorer-tests.sh
Quick update: I got unexpectedly delayed with some things that came up, so I made less progress than I'd hoped but I'll continue it in the evenings this week/next weekend to get this sorted.
Quick update: I got unexpectedly delayed with some things that came up, so I made less progress than I'd hoped but I'll continue it in the evenings this week/next weekend to get this sorted.
Gentle ping?
Sorry, I haven't had a great deal of time so this isn't likely to proceed much for a bit. I'll try a bit more next weekend
Sorry, I haven't had a great deal of time so this isn't likely to proceed much for a bit. I'll try a bit more next weekend
@nmstoker Gentle ping? It's okay if you don't have time ;)