wav2letter
wav2letter copied to clipboard
Streaming convnets with lexicon free decoder?
Question
Hi,
Is there a way to run streaming convnets examples (Interactive/Streaming ASR examples) with a LexiconFreeDecoder
? I do see a class for LexiconFreeDecoder
in the source code. In the DecoderFactory
(link), if lexicon trie is not supplied, a LexiconFreeDecoder
is instantiated instead of a LexiconDecoder
.
I tried supplying an empty path for the lexicon dictionary and all I got were empty transcriptions. I was running the example with the AM and LM models provided on the inference wiki. It has 10k sentence piece tokens and I am not sure if the LM is word-level or token level.
What am I missing here? Is the issue that the LM is possibly word-level and hence OOV words cannot be decoded? And even if that is the case, words that are in the vocabulary should have been transcribed (my sample audio has lots of them).
Thanks!
As far as I know, lm is word based, so at least to proper use Lexicon free you need to use wp LM (in this case you try to apply word LM to wp token and this will return just unk prediction). I would still expect to have some non-empty output, right @xuqiantong @vineelpratap ?
to proper use Lexicon free you need to use wp LM
@tlikhomanenko: Is there any example of a recipe to create wp LM from LibriSpeech corpus? In the Lexicon-free recipe, I see that word and char-level LMs are created, but not 10k token-based wp LM.
I suppose, I can use preprocess "$DATA_DST/text/librispeech-lm-norm.txt.lower.shuffle"
to contain tokens instead of words and then train 3/4-gram wp LM, but I was wondering if you distribute it as part of some recipe.
Thanks!
Please see example here https://github.com/facebookresearch/wav2letter/tree/master/recipes/sota/2019/lm.
Thanks @tlikhomanenko.
I ran the prepare_wp_data.py
script and I got tokenized train, dev-clean and dev-other files in the $W2LDIR/decoder
directory:
w2luser@w2luser-MS-7A40:~$ ls $W2LDIR/decoder/ -1
lm_wp_10k.dev-clean
lm_wp_10k.dev-other
lm_wp_10k.train
I then fit a 6-gram LM on lm_wp_10k.train
as follows:
KENLM=~/Projects/kenlm/build/bin
MODEL_DST=$W2LDIR
# --prune 0 1 removes all the 2-gram and higher order n-grams that occur only once in the corpus
"$KENLM/lmplz" trie -T /tmp --discount_fallback -o 6 --prune 0 1 < "$MODEL_DST/decoder/lm_wp_10k.train" > 6_gram_lm_wp_10k.train.arpa
# Quantize the LM model to reduce memory size
"$KENLM/build_binary" trie -a 22 -q 8 -b 8 6_gram_lm_wp_10k.train.arpa 6_gram_lm_wp_10k.train.bin.qt
I then used the 6_gram_lm_wp_10k.train.bin.qt
model to run the interactive streaming ASR example (I hardcoded the lexicon file to an empty string here):
./cmake-build-debug/inference/inference/examples/interactive_streaming_asr_example --input_files_base_path /data/podcaster/model/wav2letter --language_model_file 6_gram_lm_wp_10k.train.bin.qt
Started features model file loading ...
Completed features model file loading elapsed time=103 milliseconds
Started acoustic model file loading ...
Completed acoustic model file loading elapsed time=791 milliseconds
Started tokens file loading ...
Completed tokens file loading elapsed time=1810 microseconds
Tokens loaded - 9998 tokens
Started decoder options file loading ...
Completed decoder options file loading elapsed time=91 microseconds
Started create decoder ...
[Letters] 9998 tokens loaded.
Completed create decoder elapsed time=44237 microseconds
Entering interactive command line shell. enter '?' for help.
------------------------------------------------------------
$>input=/home/w2luser/audio/cnbc.wav
Transcribing file:/home/w2luser/audio/cnbc.wav to:stdout
Creating LexiconFreeDecoder instance.
#start (msec), end(msec), transcription
terminate called after throwing an instance of 'std::runtime_error'
what(): [KenLM] Invalid user token index: 5
Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
What am I missing?
Thanks!
@tlikhomanenko: It looks to me that Lexicon free decoder is not possible to use for streaming convnets because those use KenLM (link here) and currently KenLM wouldn't work without a lexicon dict since it requires a lexicon to score tokens (link here, usrToLmIdxMap_.size()
is 0
in absence of a lexicon).
Do you have any suggestions as to how to use KenLM without a lexicon?
Thanks!
Could you try just put as lexicon file the one with all tokens and the same spelling for them? for example if your tokens set is {ab, cd, ef} the lexicon file will be:
ab ab
cd cd
ef ef
I that case the wordMap for kenlm will be mapping between AM tokens and its string for kenlm, I think this should work (maybe I am still missing something). cc @xuqiantong
Better is to construct here https://github.com/facebookresearch/wav2letter/blob/v0.2/inference/inference/decoder/Decoder.cpp#L61 wordMap using the tokens set. Or if you provide the lexicon file as I proposed above be sure to set trie to none, and use lexfree decoder creation (current ifs with lexicon non-empty will force lexicon decoding)
@tlikhomanenko: I created a dummy lexicon file as follows:
w2luser@w2luser-ThinkPad-X1-Carbon-6th:/data/podcaster/model/wav2letter$ head lexicon-dummy.txt
_the _the
_and _and
_of _of
_to _to
_a _a
s s
_in _in
_i _i
_he _he
_that _that
and ran the example again:
Entering interactive command line shell. enter '?' for help.
------------------------------------------------------------
$>input=/home/w2luser/audio/cnbc.wav
Transcribing file:/home/w2luser/audio/cnbc.wav to:stdout
Creating LexiconDecoder instance.
#start (msec), end(msec), transcription
0,1000,
1000,2000,_let _me _just _ask
2000,3000,_you _about _tu a
3000,4000,_because _they _had _their _earn ing
4000,5000,ing s _last _night
5000,6000,_they _beat
6000,7000,_a _you _know
7000,8000,_that _the _bear
8000,9000,_would _say _they _beat
9000,10000,_because _they _sold _these
10000,11000,_tax _credit s _and _how
11000,12000,_how _long _are _these
12000,13000,_te _credit _so
13000,14000,
14000,15000,_er _going _to _be _you
15000,16000,_know _a
16000,17000,_business _model
The transcription with the original lexicon is as follows:
Entering interactive command line shell. enter '?' for help.
------------------------------------------------------------
$>input=/home/w2luser/audio/cnbc.wav
Transcribing file:/home/w2luser/audio/cnbc.wav to:stdout
Creating LexiconDecoder instance.
#start (msec), end(msec), transcription
0,1000,
1000,2000,let me just ask
2000,3000,you about tua
3000,4000,because they had the
4000,5000,earnings last night
5000,6000,they beat
6000,7000,you know
7000,8000,the bears
8000,9000,would say they beat
9000,10000,because they sold these
10000,11000,tax and how
11000,12000,how long are these
12000,13000,te credit
13000,14000,
14000,15000,going to be you
15000,16000,know a
16000,17000,business model
The problem is, if you provide a dummy lexicon file, it still uses Lexicon decoder. How do you force it to use LexiconFreeDecoder? I assume, the LexiconFreeDecoder has logic built around identifying word boundaries and fusing tokens together into words.
Thanks.
You need to comment this if block too https://github.com/facebookresearch/wav2letter/blob/v0.2/inference/inference/decoder/Decoder.cpp#L74 because right now it creates the trie and here if trie is non none https://github.com/facebookresearch/wav2letter/blob/v0.2/inference/inference/decoder/Decoder.cpp#L97 it will run lexicon-based decoding.
@tlikhomanenko: I am deliberately hardcoding lexicon file path in DecoderFactory
to an empty string, which causes DecoderFactory
to create an empty trie_
, thus forcing the DecoderFactory
to return an instance of LexiconFreeDecoder
.
The problem is both LexiconDecoder
and LexiconFreeDecoder
use KenLM
and it requires lexicon file (it stores lexicon words and index mappings as a dictionary).
How do I go around this?
Thanks!
Before you said that you created the dummy lexicon. So first please test with hack when you provide this mapping in the kenlm. I suggested two ways of doing this:
- use lexicon where you put only tokens and their mapping into themselves + force trie to be empty in the code
- hack code to construct kenlm dict with tokens
Is it still unclear what I mean?
cc @xuqiantong
@tlikhomanenko: Thanks.
- use lexicon where you put only tokens and their mapping into themselves + force trie to be empty in the code
To this, I created the following files:
/data/podcaster/model/wav2letter$ head -5 tokens.txt
_the
_and
_of
_to
_a
/data/podcaster/model/wav2letter$ head -5 lexicon.txt.dummy
_the _the
_and _and
_of _of
_to _to
_a _a
I simply replicated the column of token.txt
twice to create this dummy lexicon.txt
(except the last line of tokens.txt
which contains #
token).
I had earlier created a WordPiece LM by following your instructions here. I also commented out the if
block here and made sure, by placing a breakpoint that LexiconFreeDecoder
is being used.
For a sample file, I get the following transcription:
~$ /home/w2luser/Projects/wav2letter/cmake-build-debug/inference/inference/examples/interactive_streaming_asr_example --input_files_base_path /data/podcaster/model/wav2letter --language_model_file 6_gram_lm_wp_10k.train.bin.qt --lexicon_file lexicon.txt.dummy
Entering interactive command line shell. enter '?' for help.
------------------------------------------------------------
$>input=/home/w2luser/audio/cnbc.wav
Transcribing file:/home/w2luser/audio/cnbc.wav to:stdout
Creating LexiconFreeDecoder instance.
#start (msec), end(msec), transcription
0,1000,
1000,2000,_let_me_just_ask_you
2000,3000,_you_about_tu
3000,4000,_because_they_had_their_earning
4000,5000,ings_last_night
5000,6000,_they_beat
6000,7000,_you_know
7000,8000,_bears
8000,9000,_would_say_they_beat
9000,10000,_because_they_sold_these
10000,11000,_tax_and_help
11000,12000,_how_long_are_these
12000,13000,_te_credit
13000,14000,
14000,15000,_going_be_you_know
15000,16000,_know_business
16000,17000,_business_model
17000,18000,_what_do_you_say_you_bought
18000,19000,_bought_you_you_bought
19000,20000,_into_test_but_the
20000,21000,_dead_long_time_ago
21000,22000,
Process finished with exit code 1
The transcription with if
block added back and proper lexicon file (downloaded form wiki) is as follows:
~$ /home/w2luser/Projects/wav2letter/cmake-build-debug/inference/inference/examples/interactive_streaming_asr_example --input_files_base_path /data/podcaster/model/wav2letter
Entering interactive command line shell. enter '?' for help.
------------------------------------------------------------
$>input=/home/w2luser/audio/cnbc.wav
Transcribing file:/home/w2luser/audio/cnbc.wav to:stdout
Creating LexiconDecoder instance.
#start (msec), end(msec), transcription
0,1000,
1000,2000,let me just ask
2000,3000,you about tua
3000,4000,because they had the
4000,5000,earnings last night
5000,6000,they beat
6000,7000,a you know
7000,8000,that the bears
8000,9000,would say they beat
9000,10000,because they sold these
10000,11000,tax credits and how
11000,12000,how long are these
12000,13000,te credit so
13000,14000,
14000,15000,er going to be you
15000,16000,know a
16000,17000,business model
17000,18000,what do you say you
18000,19000,bought you you bought
19000,20000,into test but to the
20000,21000,dead a long time ago
21000,22000,
Process finished with exit code 15
What am I missing?
Thanks!
Seems post processing of transcription is not fully correct (like you have still _ and not merged tokens) with lexicon free cc @xuqiantong