wav2letter icon indicating copy to clipboard operation
wav2letter copied to clipboard

Streaming convnets with lexicon free decoder?

Open abhinavkulkarni opened this issue 3 years ago • 13 comments

Question

Hi,

Is there a way to run streaming convnets examples (Interactive/Streaming ASR examples) with a LexiconFreeDecoder? I do see a class for LexiconFreeDecoder in the source code. In the DecoderFactory (link), if lexicon trie is not supplied, a LexiconFreeDecoder is instantiated instead of a LexiconDecoder.

I tried supplying an empty path for the lexicon dictionary and all I got were empty transcriptions. I was running the example with the AM and LM models provided on the inference wiki. It has 10k sentence piece tokens and I am not sure if the LM is word-level or token level.

What am I missing here? Is the issue that the LM is possibly word-level and hence OOV words cannot be decoded? And even if that is the case, words that are in the vocabulary should have been transcribed (my sample audio has lots of them).

Thanks!

abhinavkulkarni avatar Nov 14 '20 04:11 abhinavkulkarni

As far as I know, lm is word based, so at least to proper use Lexicon free you need to use wp LM (in this case you try to apply word LM to wp token and this will return just unk prediction). I would still expect to have some non-empty output, right @xuqiantong @vineelpratap ?

tlikhomanenko avatar Nov 14 '20 05:11 tlikhomanenko

to proper use Lexicon free you need to use wp LM

@tlikhomanenko: Is there any example of a recipe to create wp LM from LibriSpeech corpus? In the Lexicon-free recipe, I see that word and char-level LMs are created, but not 10k token-based wp LM.

I suppose, I can use preprocess "$DATA_DST/text/librispeech-lm-norm.txt.lower.shuffle" to contain tokens instead of words and then train 3/4-gram wp LM, but I was wondering if you distribute it as part of some recipe.

Thanks!

abhinavkulkarni avatar Nov 17 '20 11:11 abhinavkulkarni

Please see example here https://github.com/facebookresearch/wav2letter/tree/master/recipes/sota/2019/lm.

tlikhomanenko avatar Nov 17 '20 22:11 tlikhomanenko

Thanks @tlikhomanenko.

I ran the prepare_wp_data.py script and I got tokenized train, dev-clean and dev-other files in the $W2LDIR/decoder directory:

w2luser@w2luser-MS-7A40:~$ ls $W2LDIR/decoder/ -1
lm_wp_10k.dev-clean
lm_wp_10k.dev-other
lm_wp_10k.train

I then fit a 6-gram LM on lm_wp_10k.train as follows:

KENLM=~/Projects/kenlm/build/bin
MODEL_DST=$W2LDIR

# --prune 0 1 removes all the 2-gram and higher order n-grams that occur only once in the corpus
"$KENLM/lmplz" trie -T /tmp --discount_fallback -o 6 --prune 0 1 < "$MODEL_DST/decoder/lm_wp_10k.train" > 6_gram_lm_wp_10k.train.arpa

# Quantize the LM model to reduce memory size
"$KENLM/build_binary" trie -a 22 -q 8 -b 8 6_gram_lm_wp_10k.train.arpa 6_gram_lm_wp_10k.train.bin.qt

I then used the 6_gram_lm_wp_10k.train.bin.qt model to run the interactive streaming ASR example (I hardcoded the lexicon file to an empty string here):

./cmake-build-debug/inference/inference/examples/interactive_streaming_asr_example --input_files_base_path /data/podcaster/model/wav2letter --language_model_file 6_gram_lm_wp_10k.train.bin.qt
Started features model file loading ... 
Completed features model file loading elapsed time=103 milliseconds

Started acoustic model file loading ... 
Completed acoustic model file loading elapsed time=791 milliseconds

Started tokens file loading ... 
Completed tokens file loading elapsed time=1810 microseconds

Tokens loaded - 9998 tokens
Started decoder options file loading ... 
Completed decoder options file loading elapsed time=91 microseconds

Started create decoder ... 
[Letters] 9998 tokens loaded.
Completed create decoder elapsed time=44237 microseconds

Entering interactive command line shell. enter '?' for help.
------------------------------------------------------------
$>input=/home/w2luser/audio/cnbc.wav
Transcribing file:/home/w2luser/audio/cnbc.wav to:stdout
Creating LexiconFreeDecoder instance.
#start (msec), end(msec), transcription
terminate called after throwing an instance of 'std::runtime_error'
  what():  [KenLM] Invalid user token index: 5

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

What am I missing?

Thanks!

abhinavkulkarni avatar Nov 18 '20 06:11 abhinavkulkarni

@tlikhomanenko: It looks to me that Lexicon free decoder is not possible to use for streaming convnets because those use KenLM (link here) and currently KenLM wouldn't work without a lexicon dict since it requires a lexicon to score tokens (link here, usrToLmIdxMap_.size() is 0 in absence of a lexicon).

Do you have any suggestions as to how to use KenLM without a lexicon?

Thanks!

abhinavkulkarni avatar Nov 18 '20 08:11 abhinavkulkarni

Could you try just put as lexicon file the one with all tokens and the same spelling for them? for example if your tokens set is {ab, cd, ef} the lexicon file will be:

ab ab
cd cd
ef ef

I that case the wordMap for kenlm will be mapping between AM tokens and its string for kenlm, I think this should work (maybe I am still missing something). cc @xuqiantong

tlikhomanenko avatar Nov 18 '20 10:11 tlikhomanenko

Better is to construct here https://github.com/facebookresearch/wav2letter/blob/v0.2/inference/inference/decoder/Decoder.cpp#L61 wordMap using the tokens set. Or if you provide the lexicon file as I proposed above be sure to set trie to none, and use lexfree decoder creation (current ifs with lexicon non-empty will force lexicon decoding)

tlikhomanenko avatar Nov 18 '20 11:11 tlikhomanenko

@tlikhomanenko: I created a dummy lexicon file as follows:

w2luser@w2luser-ThinkPad-X1-Carbon-6th:/data/podcaster/model/wav2letter$ head lexicon-dummy.txt 
_the	_the
_and	_and
_of	_of
_to	_to
_a	_a
s	s
_in	_in
_i	_i
_he	_he
_that	_that

and ran the example again:

Entering interactive command line shell. enter '?' for help.
------------------------------------------------------------
$>input=/home/w2luser/audio/cnbc.wav
Transcribing file:/home/w2luser/audio/cnbc.wav to:stdout
Creating LexiconDecoder instance.
#start (msec), end(msec), transcription
0,1000,
1000,2000,_let _me _just _ask 
2000,3000,_you _about _tu a 
3000,4000,_because _they _had _their _earn ing 
4000,5000,ing s _last _night 
5000,6000,_they _beat 
6000,7000,_a _you _know 
7000,8000,_that _the _bear 
8000,9000,_would _say _they _beat 
9000,10000,_because _they _sold _these 
10000,11000,_tax _credit s _and _how 
11000,12000,_how _long _are _these 
12000,13000,_te _credit _so 
13000,14000,
14000,15000,_er _going _to _be _you 
15000,16000,_know _a 
16000,17000,_business _model 

The transcription with the original lexicon is as follows:

Entering interactive command line shell. enter '?' for help.
------------------------------------------------------------
$>input=/home/w2luser/audio/cnbc.wav
Transcribing file:/home/w2luser/audio/cnbc.wav to:stdout
Creating LexiconDecoder instance.
#start (msec), end(msec), transcription
0,1000,
1000,2000,let me just ask 
2000,3000,you about tua 
3000,4000,because they had the 
4000,5000,earnings last night 
5000,6000,they beat 
6000,7000,you know 
7000,8000,the bears 
8000,9000,would say they beat 
9000,10000,because they sold these 
10000,11000,tax and how 
11000,12000,how long are these 
12000,13000,te credit 
13000,14000,
14000,15000,going to be you 
15000,16000,know a 
16000,17000,business model 

The problem is, if you provide a dummy lexicon file, it still uses Lexicon decoder. How do you force it to use LexiconFreeDecoder? I assume, the LexiconFreeDecoder has logic built around identifying word boundaries and fusing tokens together into words.

Thanks.

abhinavkulkarni avatar Nov 18 '20 11:11 abhinavkulkarni

You need to comment this if block too https://github.com/facebookresearch/wav2letter/blob/v0.2/inference/inference/decoder/Decoder.cpp#L74 because right now it creates the trie and here if trie is non none https://github.com/facebookresearch/wav2letter/blob/v0.2/inference/inference/decoder/Decoder.cpp#L97 it will run lexicon-based decoding.

tlikhomanenko avatar Nov 25 '20 03:11 tlikhomanenko

@tlikhomanenko: I am deliberately hardcoding lexicon file path in DecoderFactory to an empty string, which causes DecoderFactory to create an empty trie_, thus forcing the DecoderFactory to return an instance of LexiconFreeDecoder.

The problem is both LexiconDecoder and LexiconFreeDecoder use KenLM and it requires lexicon file (it stores lexicon words and index mappings as a dictionary).

How do I go around this?

Thanks!

abhinavkulkarni avatar Dec 02 '20 18:12 abhinavkulkarni

Before you said that you created the dummy lexicon. So first please test with hack when you provide this mapping in the kenlm. I suggested two ways of doing this:

  1. use lexicon where you put only tokens and their mapping into themselves + force trie to be empty in the code
  2. hack code to construct kenlm dict with tokens

Is it still unclear what I mean?

cc @xuqiantong

tlikhomanenko avatar Dec 02 '20 18:12 tlikhomanenko

@tlikhomanenko: Thanks.

  1. use lexicon where you put only tokens and their mapping into themselves + force trie to be empty in the code

To this, I created the following files:

/data/podcaster/model/wav2letter$ head -5 tokens.txt 
_the
_and
_of
_to
_a
/data/podcaster/model/wav2letter$ head -5 lexicon.txt.dummy
_the	_the
_and	_and
_of	_of
_to	_to
_a	_a

I simply replicated the column of token.txt twice to create this dummy lexicon.txt (except the last line of tokens.txt which contains # token).

I had earlier created a WordPiece LM by following your instructions here. I also commented out the if block here and made sure, by placing a breakpoint that LexiconFreeDecoder is being used.

For a sample file, I get the following transcription:

~$ /home/w2luser/Projects/wav2letter/cmake-build-debug/inference/inference/examples/interactive_streaming_asr_example --input_files_base_path /data/podcaster/model/wav2letter --language_model_file 6_gram_lm_wp_10k.train.bin.qt --lexicon_file lexicon.txt.dummy

Entering interactive command line shell. enter '?' for help.
------------------------------------------------------------
$>input=/home/w2luser/audio/cnbc.wav
Transcribing file:/home/w2luser/audio/cnbc.wav to:stdout
Creating LexiconFreeDecoder instance.
#start (msec), end(msec), transcription
0,1000,
1000,2000,_let_me_just_ask_you 
2000,3000,_you_about_tu 
3000,4000,_because_they_had_their_earning 
4000,5000,ings_last_night 
5000,6000,_they_beat 
6000,7000,_you_know 
7000,8000,_bears 
8000,9000,_would_say_they_beat 
9000,10000,_because_they_sold_these 
10000,11000,_tax_and_help 
11000,12000,_how_long_are_these 
12000,13000,_te_credit 
13000,14000,
14000,15000,_going_be_you_know 
15000,16000,_know_business 
16000,17000,_business_model 
17000,18000,_what_do_you_say_you_bought 
18000,19000,_bought_you_you_bought 
19000,20000,_into_test_but_the 
20000,21000,_dead_long_time_ago 
21000,22000,
Process finished with exit code 1

The transcription with if block added back and proper lexicon file (downloaded form wiki) is as follows:

~$ /home/w2luser/Projects/wav2letter/cmake-build-debug/inference/inference/examples/interactive_streaming_asr_example --input_files_base_path /data/podcaster/model/wav2letter

Entering interactive command line shell. enter '?' for help.
------------------------------------------------------------
$>input=/home/w2luser/audio/cnbc.wav
Transcribing file:/home/w2luser/audio/cnbc.wav to:stdout
Creating LexiconDecoder instance.
#start (msec), end(msec), transcription
0,1000,
1000,2000,let me just ask 
2000,3000,you about tua 
3000,4000,because they had the 
4000,5000,earnings last night 
5000,6000,they beat 
6000,7000,a you know 
7000,8000,that the bears 
8000,9000,would say they beat 
9000,10000,because they sold these 
10000,11000,tax credits and how 
11000,12000,how long are these 
12000,13000,te credit so 
13000,14000,
14000,15000,er going to be you 
15000,16000,know a 
16000,17000,business model 
17000,18000,what do you say you 
18000,19000,bought you you bought 
19000,20000,into test but to the 
20000,21000,dead a long time ago 
21000,22000,

Process finished with exit code 15

What am I missing?

Thanks!

abhinavkulkarni avatar Dec 03 '20 10:12 abhinavkulkarni

Seems post processing of transcription is not fully correct (like you have still _ and not merged tokens) with lexicon free cc @xuqiantong

tlikhomanenko avatar Dec 04 '20 10:12 tlikhomanenko