wav2letter Any example code using the new pretrained models?

trafficstars

I built wav2letter from source and could run the inference tutorial without any trouble, but that's still using an older model, what if I want to get a transcription using the newest pretrained models?

Could you show us some examples?

I couldn't get them to work by following the Beam Search Decoder tutorial, there're so many options that it's hard to narrow down all possible causes, however, if I'm shown an example that I know for sure it's supposed to work, then it's easier to find out.

Jan 20 '20 05:01 AndroYD84

Hi,

The inference tutorial instruction point you to S3 bucket with a model. This model is new and shiny.

What are you trying to do? What errors do you get? Please be more specific. For specific debugging issues please add reproduction procedure and logs.

Jan 20 '20 23:01 avidov

@avidov which model is this exactly ? From the naming of the files, I guess it is the same as the streaming_convnet folder ?

Jan 21 '20 13:01 tdeboissiere

The inference tutorial instruction point you to S3 bucket with a model. This model is new and shiny.

Is this model as accurate as your new SOTA 2019 model? I jumped to the conclusion that I was still using an older model because I got better results with ESPnet and their Librispeech pretrained model than with WAV2LETTER and the provided model (although yours is much faster), based on the paper results yours is (theorically) supposed to be more accurate.

What are you trying to do?

I'm trying to transcribe some audio files, I want the generated text to be as close as possible to the ground truth, here are some results with my custom dataset for comparison:

tspeak0001.wav
ESPNET= THE UNLIMITED BE IN A MAJOR WAR WITH I AM EAGER TO WORK WITH YOU ON LEDGE
GROUND TRUTH = THE UNLIMITED BE IN A MAJOR WAR WITH I AM EAGER TO WORK WITH YOU ON LEDGE
WAV2LETTER= on a major war with my eager to work with you

tspeak0002.wav
ESPNET= NAPHTHA OR WHETHER IT WORLD TRADE ORGANIZATION WHICH CREATE
GROUND TRUTH = BREACH NAPHTHA OR WHETHER IT WORLD TRADE ORGANIZATION WHICH CREATE
WAV2LETTER= it's not or whether it was trade organization would create

tspeak0003.wav
ESPNET= WHEN I HEARD ABOUT CHARLOTTE'S BILL WE'RE CLOSELY FOLLOWING THE STRONGEST WAY
GROUND TRUTH= WHEN I HEARD ABOUT CHARLOTTE'S BILL WE'RE CLOSELY FOLLOWING THE STRONGEST POSSIBLE
WAV2LETTER= when i heard about all for what we the strongest possible

tspeak0004.wav
ESPNET= AND THE BAD PEOPLE IN PEOPLE GIVING A PLATFORM TO THESE
GROUND TRUTH= COULDN'T SEE AND THE BAD PEOPLE ONLY PEOPLE GIVING A PLATFORM TO THESE
WAV2LETTER= a bad people a black to

tspeak0005.wav
ESPNET= COMMITTED TO PASS IN THE HISTORY OF OUR COUNTRY IF THEY DO REMEMBER THEY ARE
GROUND TRUTH= COMMITTED TO PASS BUT IN THE HISTORY OF OUR COUNTRY IF THEY DO REMEMBER THEY ARE
WAV2LETTER= committed the path in the history of our country if they do remember they are

tspeak0006.wav
ESPNET= HISTORIC CHARACTERS FOR ALL AMERICANS TO DEFEND AMERICA
GROUND TRUTH= HISTORIC CHARACTERS FOR ALL AMERICANS TO DEFEND AMERICA
WAV2LETTER= the news for all emerged to

tspeak0007.wav
ESPNET= IN WORKERS AND AMERICAN FAMILIES THEY'VE ALLOWED MILLIONS OF LOW WAGE WORKERS
GROUND TRUTH= IN WORKERS AND AMERICAN FAMILIES THEY'VE ALLOWED MILLIONS OF LOW WAGE WORKERS
WAV2LETTER= in war proof of american tram they have allowed me to blow wage workers

tspeak0008.wav
ESPNET= TOYOTA AND MARSA ARE OPENING UP A PLANT IN ALI
GROUND TRUTH= TOYOTA AND MAZDA ARE OPENING UP A PLANT IN ALI
WAV2LETTER= for a month all pretty girl a plan in

tspeak0009.wav
ESPNET= BRYAN SAID HE FELT GOD'S CALM WE ARE ALSO WE SAW OURS
GROUND TRUTH= BUT BRYAN SAID HE FELT GOD'S CALM WE ARE ALSO WE SAW OURS
WAV2LETTER= for god we are all restored

tspeak0010.wav
ESPNET= DURING THE HOUR OF THE NIGHT I CALL ON SAUNDERS TO EMPOWER THE ENDING
GROUND TRUTH= DURING HOUR TONIGHT I CALL ON CONGRESS TO EMPOWER THE ENDING GAIN
WAV2LETTER= our night at all on to to go

What errors do you get?

When I run this command:

luca@luca-ubnt:~/Projects/wav2letter2/build$ ./Decoder --flagsfile /home/luca/Projects/wav2letter2/recipes/models/sota/2019/librispeech/decode_tds_s2s_ngram_other.cfg --minloglevel=0 --logtostderr=1

I get this error:

I0121 17:32:55.934908  9615 Decode.cpp:57] Reading flags from file /home/luca/Projects/wav2letter2/recipes/models/sota/2019/librispeech/decode_tds_s2s_ngram_other.cfg
I0121 17:32:55.935243  9615 Decode.cpp:85] [Network] Reading acoustic model from /home/luca/Projects/wav2letter2/modelsota2019/am_tds_s2s_librispeech_dev_other.bin
I0121 17:32:58.848897  9615 Decode.cpp:89] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> output]
	(0): SpecAugment ( W: 80, F: 27, mF: 2, T: 100, p: 1, mT: 2 )
	(1): View (-1 80 1 0)
	(2): Conv2D (1->10, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
	(3): ReLU
	(4): Dropout (0.000000)
	(5): LayerNorm ( axis : { 0 1 2 } , size : -1)
	(6): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
	(7): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
	(8): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
	(9): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
	(10): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
	(11): Conv2D (10->14, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
	(12): ReLU
	(13): Dropout (0.000000)
	(14): LayerNorm ( axis : { 0 1 2 } , size : -1)
	(15): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
	(16): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
	(17): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
	(18): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
	(19): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
	(20): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
	(21): Conv2D (14->18, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
	(22): ReLU
	(23): Dropout (0.000000)
	(24): LayerNorm ( axis : { 0 1 2 } , size : -1)
	(25): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
	(26): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
	(27): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
	(28): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
	(29): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
	(30): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
	(31): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
	(32): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
	(33): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
	(34): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
	(35): View (0 1440 1 0)
	(36): Reorder (1,0,3,2)
	(37): Linear (1440->1024) (with bias)
I0121 17:32:58.848953  9615 Decode.cpp:92] [Criterion] Seq2SeqCriterion
I0121 17:32:58.848970  9615 Decode.cpp:94] [Network] Number of params: 190462588
I0121 17:32:58.848974  9615 Decode.cpp:100] [Network] Updating flags from config file: /home/luca/Projects/wav2letter2/modelsota2019/am_tds_s2s_librispeech_dev_other.bin
I0121 17:32:58.849952  9615 Decode.cpp:116] Gflags after parsing 
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/home/luca/Projects/wav2letter2/modelsota2019/am_tds_s2s_librispeech_dev_other.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=tds_do0.15_l5.6.10_mid3.0.arch1; --archdir=/private/home/qiantong/push_numbers/200M/do0.15_l5.6.10_mid3.0_incDO; --attention=keyvalue; --attentionthreshold=30; --attnWindow=softPretrain; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=4; --beamsize=50; --beamsizetoken=10; --beamthreshold=10; --blobdata=false; --channels=1; --criterion=seq2seq; --critoptim=sgd; --datadir=/home/luca/Projects/wav2letter2/input; --dataorder=output_spiral; --decoderattnround=2; --decoderdropout=0.10000000000000001; --decoderrnnlayer=3; --decodertype=tkn; --devwin=0; --emission_dir=; --enable_distributed=true; --encoderdim=512; --eosscore=-3.8305165336383; --eostoken=true; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/home/luca/Projects/wav2letter2/recipes/models/sota/2019/librispeech/decode_tds_s2s_ngram_other.cfg; --framesizems=30; --framestridems=10; --gamma=0.5; --gumbeltemperature=1; --input=flac; --inputbinsize=25; --inputfeeding=false; --iter=600; --itersave=false; --labelsmooth=0.050000000000000003; --leftWindowSize=50; --lexicon=/home/luca/Projects/wav2letter2/modelsota2019/decoder-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=/home/luca/Projects/wav2letter2/modelsota2019/lm_librispeech_kenlm_wp_10k_6gram_pruning_000012.bin; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=1.1583913669221; --localnrmlleftctx=0; --localnrmlrightctx=0; --logadd=false; --lr=0.059999999999999998; --lrcosine=false; --lrcrit=0.059999999999999998; --maxdecoderoutputlen=120; --maxgradnorm=15; --maxisz=9223372036854775807; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=8338608; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz=0; --momentum=0.5; --netoptim=sgd; --noresample=false; --nthread=6; --nthread_decoder=1; --numattnhead=8; --onorm=none; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=99; --pcttraineval=1; --pow=false; --pretrainWindow=3; --replabel=0; --reportiters=0; --rightWindowSize=50; --rndv_filepath=/checkpoint/qiantong/ls_200M/do0.15_l5.6.10_mid3.0_incDO/4_rndv; --rundir=/checkpoint/qiantong/ls_200M; --runname=tds_do0.15_l5.6.10_mid3.0_incDO; --samplerate=16000; --sampletarget=0.01; --samplingstrategy=rand; --sclite=/home/luca/Projects/wav2letter2/output; --seed=2; --show=true; --showletters=true; --silscore=0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=4; --sqnorm=false; --stepsize=150; --surround=; --tag=; --target=ltr; --test=audiolist.txt; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/home/luca/Projects/wav2letter2/modelsota2019; --train=train-clean-100.lst,train-clean-360.lst,train-other-500.lst; --trainWithWindow=true; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=false; --usewordpiece=true; --valid=librispeech/dev-other:dev-other.lst,librispeech/dev-clean:dev-clean.lst; --weightdecay=0; --wordscore=0; --wordseparator=_; --world_rank=0; --world_size=32; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=; 
I0121 17:32:58.853590  9615 Decode.cpp:137] Number of classes (network): 9998
I0121 17:33:00.242255  9615 Decode.cpp:144] Number of words: 200001
F0121 17:33:00.649824  9615 W2lListFilesDataset.cpp:116] Cannot parse /home/luca/Projects/wav2letter2/input/tspeak0001.wav
*** Check failure stack trace: ***
    @     0x7f12fd1970cd  google::LogMessage::Fail()
    @     0x7f12fd198f33  google::LogMessage::SendToLog()
    @     0x7f12fd196c28  google::LogMessage::Flush()
    @     0x7f12fd199999  google::LogMessageFatal::~LogMessageFatal()
    @     0x5629b97175fb  w2l::W2lListFilesDataset::loadListFile()
    @     0x5629b97183b9  w2l::W2lListFilesDataset::W2lListFilesDataset()
    @     0x5629b97371ee  w2l::createDataset()
    @     0x5629b953c55b  main
    @     0x7f12fc1eeb97  __libc_start_main
    @     0x5629b95985da  _start
Aborted (core dumped)

This is the content from my "decode_tds_s2s_ngram_other.cfg" file:

# Replace `[...]`, `[DATA_DST]`, `[MODEL_DST]` with appropriate paths
# for test-other (best params for dev-other)
--am=/home/luca/Projects/wav2letter2/modelsota2019/am_tds_s2s_librispeech_dev_other.bin
--tokensdir=/home/luca/Projects/wav2letter2/modelsota2019
--tokens=librispeech-train-all-unigram-10000.tokens
--lexicon=/home/luca/Projects/wav2letter2/modelsota2019/decoder-unigram-10000-nbest10.lexicon
--lm=/home/luca/Projects/wav2letter2/modelsota2019/lm_librispeech_kenlm_wp_10k_6gram_pruning_000012.bin
--datadir=/home/luca/Projects/wav2letter2/input
--test=audiolist.txt
--uselexicon=false
--sclite=/home/luca/Projects/wav2letter2/output
--decodertype=tkn
--lmtype=kenlm
--beamsize=50
--beamsizetoken=10
--beamthreshold=10
--attentionthreshold=30
--smoothingtemperature=1
--nthread_decoder=1
--show
--showletters
--lmweight=1.1583913669221
--eosscore=-3.8305165336383

I didn't follow the suggested folder structure, but I guess that shouldn't be a problem as long as the file paths are pointing the correct files.

Jan 21 '20 14:01 AndroYD84

@tdeboissiere

@avidov which model is this exactly ? From the naming of the files, I guess it is the same as the streaming_convnet folder ?

The acoustic and language models are in S3 bucket: s3://dl.fbaipublicfiles.com/wav2letter/inference/examples/model/ named: acoustic_model.bin language_model.bin

@AndroYD84 Thank you for the detailed post. I'll look more closely into it and come back to you later.

Out of interest, which ESPnet module and tutorial did you use?

Jan 21 '20 16:01 avidov

@avidov Sorry, my question was perhaps unclear.

Like @AndroYD84, I was wondering from which paper paper (SOTA 2019, Streaming convnet), and with what architecture (Transformer, TDS, ResNet) this model was obtained ?

Jan 21 '20 17:01 tdeboissiere

Hi @AndroYD84,

Is this model as accurate as your new SOTA 2019 model? I jumped to the conclusion that I was still using an older model because I got better results with ESPnet and their Librispeech pretrained model than with WAV2LETTER and the provided model (although yours is much faster), based on the paper results yours is (theorically) supposed to be more accurate.

First of all, if we are saying about the best model from the SOTA 2019 trained only on Librispeech it is Transformer Seq2Seq model which should be decoded with ngram/GCNN and then rescored with transformer language model. The best models we have with additional usage of unsupervised data are Transformer Seq2Seq model and TDS Seq2Seq model (trained with Librispeech + Librivox) with ngram LM decoding (which current SOTA on librispeech). You should use these models from the links here.

About inference: the best model from inference paper is the same as in tutorial for inference. This model is TDS CTC (which is not the best from the SOTA). You can compare it with the TDS CTC model from the SOTA paper. The last one is a bit better than inference model because decoding with 4-gram LM is used instead of 3-gram as for inference paper.

When I run this command:

luca@luca-ubnt:~/Projects/wav2letter2/build$ ./Decoder --flagsfile /home/luca/Projects/wav2letter2/recipes/models/sota/2019/librispeech/decode_tds_s2s_ngram_other.cfg --minloglevel=0 --logtostderr=1

I get this error:

Here the file format for audiolist.txt should be

[utterance id] [audio file (full path)] [audio length] [word transcripts]

where word transcription can be empty if you haven't golden transcription for an audio.

Could you send details how you got these transcriptions for your audio: which w2l model you used, how did you run inference (with beam-search decoding or not, what lexicon you used), what exactly you used from ESPnet? Can you send one of your audio files, we will debug a bit why it is so worse than ESPnet model?

Jan 21 '20 22:01 tlikhomanenko

@avidov

Out of interest, which ESPnet module and tutorial did you use?

After building ESPnet, I ran the ASR Demo example on my WAV audio files, I used their pretrained "librispeech.transformer.v1" model (Joint-CTC attention Transformer trained on Librispeech) as shown in the demo.

It works pretty much like this, first I go to this folder: cd espnet/egs/librispeech/asr1 Then run: ../../../utils/recog_wav.sh --models librispeech.transformer.v1 /path/to/audio.wav

The model should be downloaded automatically in "espnet/egs/librispeech/asr1/decode/download/librispeech.transformer.v1" folder after running that command (if my memory is correct).

@tlikhomanenko

Could you send details how you got these transcriptions for your audio: which w2l model you used, how did you run inference (with beam-search decoding or not, what lexicon you used), what exactly you used from ESPnet?

To obtain the ESPnet transcriptions, I did as my reply to avidov above. To obtain the WAV2LETTER transcriptions, I did all the steps exactly as shown in the Inference Framework tutorial using the AWS S3 trained model, then I got my transcriptions by running the Simple Streaming Asr Example on my own audio files (ie. inference/inference/examples/simple_streaming_asr_example --input_files_base_path /path/to/aws/s3/model --input_audio_file /path/to/my/audio.wav ) To obtain the ground truth transcriptions, first I ran the audio files on Google Cloud Speech-to-Text service, then I manually cleaned the results from all errors by listening them one by one.

Can you send one of your audio files, we will debug a bit why it is so worse than ESPnet model?

I have shared the files I used for my test here: https://github.com/AndroYD84/Files/blob/master/tspeech_test.zip If needed, I can share the full dataset privately which is 4750 audio files (about 2 hours and 58 minutes long) with their transcriptions.

Jan 22 '20 06:01 AndroYD84

@AndroYD84,

To compare with ESPnet (if you don't care about speed for now) you should use model Transformer S2S on Librivox without any beam-search decoder or with ngram (convlm will be released a bit later).

The model from the inference is TDS CTC (which is much faster than other models). We are working on improving its quality to be closer to the Transformer S2S models.

I have shared the files I used for my test here: https://github.com/AndroYD84/Files/blob/master/tspeech_test.zip If needed, I can share the full dataset privately which is 4750 audio files (about 2 hours and 58 minutes long) with their transcriptions.

Thanks for sharing! For now it is sufficient, let us do some checks for inference model. However, could you also test our transformer S2S model I pointed above to compare with ESPnet?

Jan 22 '20 23:01 tlikhomanenko

@tlikhomanenko

However, could you also test our transformer S2S model I pointed above to compare with ESPnet?

Here are the results using "decode_transformer_s2s_ngram_other.cfg" :

tspeak0001.wav
ESPNET= THE UNLIMITED BE IN A MAJOR WAR WITH I AM EAGER TO WORK WITH YOU ON LEDGE
GROUND TRUTH = THE UNLIMITED BE IN A MAJOR WAR WITH I AM EAGER TO WORK WITH YOU ON LEDGE
WAV2LETTER= the unlimited me in a major war with which i am eager to work with you on a legend

tspeak0002.wav
ESPNET= NAPHTHA OR WHETHER IT WORLD TRADE ORGANIZATION WHICH CREATE
GROUND TRUTH = BREACH NAPHTHA OR WHETHER IT WORLD TRADE ORGANIZATION WHICH CREATE
WAV2LETTER= its laughter or whether it world trade organization would create

tspeak0003.wav
ESPNET= WHEN I HEARD ABOUT CHARLOTTE'S BILL WE'RE CLOSELY FOLLOWING THE STRONGEST WAY
GROUND TRUTH= WHEN I HEARD ABOUT CHARLOTTE'S BILL WE'RE CLOSELY FOLLOWING THE STRONGEST POSSIBLE
WAV2LETTER= when i heard about charlottesville were closely following the strongest possible

tspeak0004.wav
ESPNET= AND THE BAD PEOPLE IN PEOPLE GIVING A PLATFORM TO THESE
GROUND TRUTH= COULDN'T SEE AND THE BAD PEOPLE ONLY PEOPLE GIVING A PLATFORM TO THESE
WAV2LETTER= and the sick and the bad people giving a bad form to the

tspeak0005.wav
ESPNET= COMMITTED TO PASS IN THE HISTORY OF OUR COUNTRY IF THEY DO REMEMBER THEY ARE
GROUND TRUTH= COMMITTED TO PASS BUT IN THE HISTORY OF OUR COUNTRY IF THEY DO REMEMBER THEY ARE
WAV2LETTER= are committed to part in the history of our country if they do remember they are

tspeak0006.wav
ESPNET= HISTORIC CHARACTERS FOR ALL AMERICANS TO DEFEND AMERICA
GROUND TRUTH= HISTORIC CHARACTERS FOR ALL AMERICANS TO DEFEND AMERICA
WAV2LETTER= the four great things for all a man to defend a mark

tspeak0007.wav
ESPNET= IN WORKERS AND AMERICAN FAMILIES THEY'VE ALLOWED MILLIONS OF LOW WAGE WORKERS
GROUND TRUTH= IN WORKERS AND AMERICAN FAMILIES THEY'VE ALLOWED MILLIONS OF LOW WAGE WORKERS
WAV2LETTER= and workers and american families they have allowed millions of low wages

tspeak0008.wav
ESPNET= TOYOTA AND MARSA ARE OPENING UP A PLANT IN ALI
GROUND TRUTH= TOYOTA AND MAZDA ARE OPENING UP A PLANT IN ALI
WAV2LETTER= boor and mozart are opening up a plant in our

tspeak0009.wav
ESPNET= BRYAN SAID HE FELT GOD'S CALM WE ARE ALSO WE SAW OURS
GROUND TRUTH= BUT BRYAN SAID HE FELT GOD'S CALM WE ARE ALSO WE SAW OURS
WAV2LETTER= bryan said he felt god gone we are also restoring ours

tspeak0010.wav
ESPNET= DURING THE HOUR OF THE NIGHT I CALL ON SAUNDERS TO EMPOWER THE ENDING
GROUND TRUTH= DURING HOUR TONIGHT I CALL ON CONGRESS TO EMPOWER THE ENDING GAIN
WAV2LETTER= the hour of the night i call on thousands to a power ending game

By the way, I forgot to save the results but I think the results were likely better with "decode_tds_s2s_ngram_other.cfg", I have to go in 5 minutes so I can't retest now, sorry.

I also confirm to have solved my previous problem, it was only the "audiolist.txt" file that had an incorrect formatting, I corrected like you suggested and it works now: [utterance id] [audio file (full path)] [audio length] [word transcripts]

Jan 23 '20 09:01 AndroYD84

@AndroYD84, could you send the final WER for comparison on your dataset ESPnet, transformer and tds-inference models?

Also I noticed that we haven't some words in the lexicon while ESPnet infer them correctly, so wonder what is the lexicon and decoding they are using?

Jan 23 '20 20:01 tlikhomanenko

@tlikhomanenko Sorry but I'm on holiday now, I can access my PC remotely but can't do much as now it's using 100% of my GPU and CPU on another task, however, I can send the dataset privately if needed.

Jan 27 '20 16:01 AndroYD84

Is it also possible to use pre-trained models in recipes using simple_streaming_asr_example instead of the model in S3 bucket mentioned here?

Apr 06 '20 15:04 realbaker1967

Is it also possible to use pre-trained models in recipes using simple_streaming_asr_example instead of the model in S3 bucket mentioned here?

@avidov @vineelpratap?

Apr 10 '20 20:04 tlikhomanenko

Is it also possible to use pre-trained models in recipes using simple_streaming_asr_example instead of the model in S3 bucket mentioned here?

Yes, looking for this as well.

@AndroYD84: Did you figure out a way to use streaming/interactive decoder with SOTA 2019 models? If so, can you please share the config file?

Thanks.

May 08 '20 20:05 abhinavkulkarni

wav2letter wav2letter copied to clipboard

Any example code using the new pretrained models?

wav2letter
wav2letter copied to clipboard