wav2letter
wav2letter copied to clipboard
Any example code using the new pretrained models?
I built wav2letter from source and could run the inference tutorial without any trouble, but that's still using an older model, what if I want to get a transcription using the newest pretrained models?
Could you show us some examples?
I couldn't get them to work by following the Beam Search Decoder tutorial, there're so many options that it's hard to narrow down all possible causes, however, if I'm shown an example that I know for sure it's supposed to work, then it's easier to find out.
Hi,
The inference tutorial instruction point you to S3 bucket with a model. This model is new and shiny.
What are you trying to do? What errors do you get? Please be more specific. For specific debugging issues please add reproduction procedure and logs.
@avidov which model is this exactly ? From the naming of the files, I guess it is the same as the streaming_convnet folder ?
The inference tutorial instruction point you to S3 bucket with a model. This model is new and shiny.
Is this model as accurate as your new SOTA 2019 model? I jumped to the conclusion that I was still using an older model because I got better results with ESPnet and their Librispeech pretrained model than with WAV2LETTER and the provided model (although yours is much faster), based on the paper results yours is (theorically) supposed to be more accurate.
What are you trying to do?
I'm trying to transcribe some audio files, I want the generated text to be as close as possible to the ground truth, here are some results with my custom dataset for comparison:
tspeak0001.wav
ESPNET= THE UNLIMITED BE IN A MAJOR WAR WITH I AM EAGER TO WORK WITH YOU ON LEDGE
GROUND TRUTH = THE UNLIMITED BE IN A MAJOR WAR WITH I AM EAGER TO WORK WITH YOU ON LEDGE
WAV2LETTER= on a major war with my eager to work with you
tspeak0002.wav
ESPNET= NAPHTHA OR WHETHER IT WORLD TRADE ORGANIZATION WHICH CREATE
GROUND TRUTH = BREACH NAPHTHA OR WHETHER IT WORLD TRADE ORGANIZATION WHICH CREATE
WAV2LETTER= it's not or whether it was trade organization would create
tspeak0003.wav
ESPNET= WHEN I HEARD ABOUT CHARLOTTE'S BILL WE'RE CLOSELY FOLLOWING THE STRONGEST WAY
GROUND TRUTH= WHEN I HEARD ABOUT CHARLOTTE'S BILL WE'RE CLOSELY FOLLOWING THE STRONGEST POSSIBLE
WAV2LETTER= when i heard about all for what we the strongest possible
tspeak0004.wav
ESPNET= AND THE BAD PEOPLE IN PEOPLE GIVING A PLATFORM TO THESE
GROUND TRUTH= COULDN'T SEE AND THE BAD PEOPLE ONLY PEOPLE GIVING A PLATFORM TO THESE
WAV2LETTER= a bad people a black to
tspeak0005.wav
ESPNET= COMMITTED TO PASS IN THE HISTORY OF OUR COUNTRY IF THEY DO REMEMBER THEY ARE
GROUND TRUTH= COMMITTED TO PASS BUT IN THE HISTORY OF OUR COUNTRY IF THEY DO REMEMBER THEY ARE
WAV2LETTER= committed the path in the history of our country if they do remember they are
tspeak0006.wav
ESPNET= HISTORIC CHARACTERS FOR ALL AMERICANS TO DEFEND AMERICA
GROUND TRUTH= HISTORIC CHARACTERS FOR ALL AMERICANS TO DEFEND AMERICA
WAV2LETTER= the news for all emerged to
tspeak0007.wav
ESPNET= IN WORKERS AND AMERICAN FAMILIES THEY'VE ALLOWED MILLIONS OF LOW WAGE WORKERS
GROUND TRUTH= IN WORKERS AND AMERICAN FAMILIES THEY'VE ALLOWED MILLIONS OF LOW WAGE WORKERS
WAV2LETTER= in war proof of american tram they have allowed me to blow wage workers
tspeak0008.wav
ESPNET= TOYOTA AND MARSA ARE OPENING UP A PLANT IN ALI
GROUND TRUTH= TOYOTA AND MAZDA ARE OPENING UP A PLANT IN ALI
WAV2LETTER= for a month all pretty girl a plan in
tspeak0009.wav
ESPNET= BRYAN SAID HE FELT GOD'S CALM WE ARE ALSO WE SAW OURS
GROUND TRUTH= BUT BRYAN SAID HE FELT GOD'S CALM WE ARE ALSO WE SAW OURS
WAV2LETTER= for god we are all restored
tspeak0010.wav
ESPNET= DURING THE HOUR OF THE NIGHT I CALL ON SAUNDERS TO EMPOWER THE ENDING
GROUND TRUTH= DURING HOUR TONIGHT I CALL ON CONGRESS TO EMPOWER THE ENDING GAIN
WAV2LETTER= our night at all on to to go
What errors do you get?
When I run this command:
luca@luca-ubnt:~/Projects/wav2letter2/build$ ./Decoder --flagsfile /home/luca/Projects/wav2letter2/recipes/models/sota/2019/librispeech/decode_tds_s2s_ngram_other.cfg --minloglevel=0 --logtostderr=1
I get this error:
I0121 17:32:55.934908 9615 Decode.cpp:57] Reading flags from file /home/luca/Projects/wav2letter2/recipes/models/sota/2019/librispeech/decode_tds_s2s_ngram_other.cfg
I0121 17:32:55.935243 9615 Decode.cpp:85] [Network] Reading acoustic model from /home/luca/Projects/wav2letter2/modelsota2019/am_tds_s2s_librispeech_dev_other.bin
I0121 17:32:58.848897 9615 Decode.cpp:89] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> output]
(0): SpecAugment ( W: 80, F: 27, mF: 2, T: 100, p: 1, mT: 2 )
(1): View (-1 80 1 0)
(2): Conv2D (1->10, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
(3): ReLU
(4): Dropout (0.000000)
(5): LayerNorm ( axis : { 0 1 2 } , size : -1)
(6): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
(7): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
(8): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
(9): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
(10): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
(11): Conv2D (10->14, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
(12): ReLU
(13): Dropout (0.000000)
(14): LayerNorm ( axis : { 0 1 2 } , size : -1)
(15): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(16): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(17): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(18): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(19): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(20): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(21): Conv2D (14->18, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
(22): ReLU
(23): Dropout (0.000000)
(24): LayerNorm ( axis : { 0 1 2 } , size : -1)
(25): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(26): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(27): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(28): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(29): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(30): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(31): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(32): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(33): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(34): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(35): View (0 1440 1 0)
(36): Reorder (1,0,3,2)
(37): Linear (1440->1024) (with bias)
I0121 17:32:58.848953 9615 Decode.cpp:92] [Criterion] Seq2SeqCriterion
I0121 17:32:58.848970 9615 Decode.cpp:94] [Network] Number of params: 190462588
I0121 17:32:58.848974 9615 Decode.cpp:100] [Network] Updating flags from config file: /home/luca/Projects/wav2letter2/modelsota2019/am_tds_s2s_librispeech_dev_other.bin
I0121 17:32:58.849952 9615 Decode.cpp:116] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/home/luca/Projects/wav2letter2/modelsota2019/am_tds_s2s_librispeech_dev_other.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=tds_do0.15_l5.6.10_mid3.0.arch1; --archdir=/private/home/qiantong/push_numbers/200M/do0.15_l5.6.10_mid3.0_incDO; --attention=keyvalue; --attentionthreshold=30; --attnWindow=softPretrain; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=4; --beamsize=50; --beamsizetoken=10; --beamthreshold=10; --blobdata=false; --channels=1; --criterion=seq2seq; --critoptim=sgd; --datadir=/home/luca/Projects/wav2letter2/input; --dataorder=output_spiral; --decoderattnround=2; --decoderdropout=0.10000000000000001; --decoderrnnlayer=3; --decodertype=tkn; --devwin=0; --emission_dir=; --enable_distributed=true; --encoderdim=512; --eosscore=-3.8305165336383; --eostoken=true; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/home/luca/Projects/wav2letter2/recipes/models/sota/2019/librispeech/decode_tds_s2s_ngram_other.cfg; --framesizems=30; --framestridems=10; --gamma=0.5; --gumbeltemperature=1; --input=flac; --inputbinsize=25; --inputfeeding=false; --iter=600; --itersave=false; --labelsmooth=0.050000000000000003; --leftWindowSize=50; --lexicon=/home/luca/Projects/wav2letter2/modelsota2019/decoder-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=/home/luca/Projects/wav2letter2/modelsota2019/lm_librispeech_kenlm_wp_10k_6gram_pruning_000012.bin; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=1.1583913669221; --localnrmlleftctx=0; --localnrmlrightctx=0; --logadd=false; --lr=0.059999999999999998; --lrcosine=false; --lrcrit=0.059999999999999998; --maxdecoderoutputlen=120; --maxgradnorm=15; --maxisz=9223372036854775807; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=8338608; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz=0; --momentum=0.5; --netoptim=sgd; --noresample=false; --nthread=6; --nthread_decoder=1; --numattnhead=8; --onorm=none; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=99; --pcttraineval=1; --pow=false; --pretrainWindow=3; --replabel=0; --reportiters=0; --rightWindowSize=50; --rndv_filepath=/checkpoint/qiantong/ls_200M/do0.15_l5.6.10_mid3.0_incDO/4_rndv; --rundir=/checkpoint/qiantong/ls_200M; --runname=tds_do0.15_l5.6.10_mid3.0_incDO; --samplerate=16000; --sampletarget=0.01; --samplingstrategy=rand; --sclite=/home/luca/Projects/wav2letter2/output; --seed=2; --show=true; --showletters=true; --silscore=0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=4; --sqnorm=false; --stepsize=150; --surround=; --tag=; --target=ltr; --test=audiolist.txt; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/home/luca/Projects/wav2letter2/modelsota2019; --train=train-clean-100.lst,train-clean-360.lst,train-other-500.lst; --trainWithWindow=true; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=false; --usewordpiece=true; --valid=librispeech/dev-other:dev-other.lst,librispeech/dev-clean:dev-clean.lst; --weightdecay=0; --wordscore=0; --wordseparator=_; --world_rank=0; --world_size=32; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I0121 17:32:58.853590 9615 Decode.cpp:137] Number of classes (network): 9998
I0121 17:33:00.242255 9615 Decode.cpp:144] Number of words: 200001
F0121 17:33:00.649824 9615 W2lListFilesDataset.cpp:116] Cannot parse /home/luca/Projects/wav2letter2/input/tspeak0001.wav
*** Check failure stack trace: ***
@ 0x7f12fd1970cd google::LogMessage::Fail()
@ 0x7f12fd198f33 google::LogMessage::SendToLog()
@ 0x7f12fd196c28 google::LogMessage::Flush()
@ 0x7f12fd199999 google::LogMessageFatal::~LogMessageFatal()
@ 0x5629b97175fb w2l::W2lListFilesDataset::loadListFile()
@ 0x5629b97183b9 w2l::W2lListFilesDataset::W2lListFilesDataset()
@ 0x5629b97371ee w2l::createDataset()
@ 0x5629b953c55b main
@ 0x7f12fc1eeb97 __libc_start_main
@ 0x5629b95985da _start
Aborted (core dumped)
This is the content from my "decode_tds_s2s_ngram_other.cfg" file:
# Replace `[...]`, `[DATA_DST]`, `[MODEL_DST]` with appropriate paths
# for test-other (best params for dev-other)
--am=/home/luca/Projects/wav2letter2/modelsota2019/am_tds_s2s_librispeech_dev_other.bin
--tokensdir=/home/luca/Projects/wav2letter2/modelsota2019
--tokens=librispeech-train-all-unigram-10000.tokens
--lexicon=/home/luca/Projects/wav2letter2/modelsota2019/decoder-unigram-10000-nbest10.lexicon
--lm=/home/luca/Projects/wav2letter2/modelsota2019/lm_librispeech_kenlm_wp_10k_6gram_pruning_000012.bin
--datadir=/home/luca/Projects/wav2letter2/input
--test=audiolist.txt
--uselexicon=false
--sclite=/home/luca/Projects/wav2letter2/output
--decodertype=tkn
--lmtype=kenlm
--beamsize=50
--beamsizetoken=10
--beamthreshold=10
--attentionthreshold=30
--smoothingtemperature=1
--nthread_decoder=1
--show
--showletters
--lmweight=1.1583913669221
--eosscore=-3.8305165336383
I didn't follow the suggested folder structure, but I guess that shouldn't be a problem as long as the file paths are pointing the correct files.
@tdeboissiere
@avidov which model is this exactly ? From the naming of the files, I guess it is the same as the
streaming_convnetfolder ?
The acoustic and language models are in S3 bucket: s3://dl.fbaipublicfiles.com/wav2letter/inference/examples/model/ named: acoustic_model.bin language_model.bin
@AndroYD84 Thank you for the detailed post. I'll look more closely into it and come back to you later.
Out of interest, which ESPnet module and tutorial did you use?
@avidov Sorry, my question was perhaps unclear.
Like @AndroYD84, I was wondering from which paper paper (SOTA 2019, Streaming convnet), and with what architecture (Transformer, TDS, ResNet) this model was obtained ?
Hi @AndroYD84,
Is this model as accurate as your new SOTA 2019 model? I jumped to the conclusion that I was still using an older model because I got better results with ESPnet and their Librispeech pretrained model than with WAV2LETTER and the provided model (although yours is much faster), based on the paper results yours is (theorically) supposed to be more accurate.
First of all, if we are saying about the best model from the SOTA 2019 trained only on Librispeech it is Transformer Seq2Seq model which should be decoded with ngram/GCNN and then rescored with transformer language model. The best models we have with additional usage of unsupervised data are Transformer Seq2Seq model and TDS Seq2Seq model (trained with Librispeech + Librivox) with ngram LM decoding (which current SOTA on librispeech). You should use these models from the links here.
About inference: the best model from inference paper is the same as in tutorial for inference. This model is TDS CTC (which is not the best from the SOTA). You can compare it with the TDS CTC model from the SOTA paper. The last one is a bit better than inference model because decoding with 4-gram LM is used instead of 3-gram as for inference paper.
When I run this command:
luca@luca-ubnt:~/Projects/wav2letter2/build$ ./Decoder --flagsfile /home/luca/Projects/wav2letter2/recipes/models/sota/2019/librispeech/decode_tds_s2s_ngram_other.cfg --minloglevel=0 --logtostderr=1I get this error:
Here the file format for audiolist.txt should be
[utterance id] [audio file (full path)] [audio length] [word transcripts]
where word transcription can be empty if you haven't golden transcription for an audio.
Could you send details how you got these transcriptions for your audio: which w2l model you used, how did you run inference (with beam-search decoding or not, what lexicon you used), what exactly you used from ESPnet? Can you send one of your audio files, we will debug a bit why it is so worse than ESPnet model?
@avidov
Out of interest, which ESPnet module and tutorial did you use?
After building ESPnet, I ran the ASR Demo example on my WAV audio files, I used their pretrained "librispeech.transformer.v1" model (Joint-CTC attention Transformer trained on Librispeech) as shown in the demo.
It works pretty much like this, first I go to this folder:
cd espnet/egs/librispeech/asr1
Then run:
../../../utils/recog_wav.sh --models librispeech.transformer.v1 /path/to/audio.wav
The model should be downloaded automatically in "espnet/egs/librispeech/asr1/decode/download/librispeech.transformer.v1" folder after running that command (if my memory is correct).
@tlikhomanenko
Could you send details how you got these transcriptions for your audio: which w2l model you used, how did you run inference (with beam-search decoding or not, what lexicon you used), what exactly you used from ESPnet?
To obtain the ESPnet transcriptions, I did as my reply to avidov above.
To obtain the WAV2LETTER transcriptions, I did all the steps exactly as shown in the Inference Framework tutorial using the AWS S3 trained model, then I got my transcriptions by running the Simple Streaming Asr Example on my own audio files (ie. inference/inference/examples/simple_streaming_asr_example --input_files_base_path /path/to/aws/s3/model --input_audio_file /path/to/my/audio.wav )
To obtain the ground truth transcriptions, first I ran the audio files on Google Cloud Speech-to-Text service, then I manually cleaned the results from all errors by listening them one by one.
Can you send one of your audio files, we will debug a bit why it is so worse than ESPnet model?
I have shared the files I used for my test here: https://github.com/AndroYD84/Files/blob/master/tspeech_test.zip If needed, I can share the full dataset privately which is 4750 audio files (about 2 hours and 58 minutes long) with their transcriptions.
@AndroYD84,
To compare with ESPnet (if you don't care about speed for now) you should use model Transformer S2S on Librivox without any beam-search decoder or with ngram (convlm will be released a bit later).
The model from the inference is TDS CTC (which is much faster than other models). We are working on improving its quality to be closer to the Transformer S2S models.
I have shared the files I used for my test here: https://github.com/AndroYD84/Files/blob/master/tspeech_test.zip If needed, I can share the full dataset privately which is 4750 audio files (about 2 hours and 58 minutes long) with their transcriptions.
Thanks for sharing! For now it is sufficient, let us do some checks for inference model. However, could you also test our transformer S2S model I pointed above to compare with ESPnet?
@tlikhomanenko
However, could you also test our transformer S2S model I pointed above to compare with ESPnet?
Here are the results using "decode_transformer_s2s_ngram_other.cfg" :
tspeak0001.wav
ESPNET= THE UNLIMITED BE IN A MAJOR WAR WITH I AM EAGER TO WORK WITH YOU ON LEDGE
GROUND TRUTH = THE UNLIMITED BE IN A MAJOR WAR WITH I AM EAGER TO WORK WITH YOU ON LEDGE
WAV2LETTER= the unlimited me in a major war with which i am eager to work with you on a legend
tspeak0002.wav
ESPNET= NAPHTHA OR WHETHER IT WORLD TRADE ORGANIZATION WHICH CREATE
GROUND TRUTH = BREACH NAPHTHA OR WHETHER IT WORLD TRADE ORGANIZATION WHICH CREATE
WAV2LETTER= its laughter or whether it world trade organization would create
tspeak0003.wav
ESPNET= WHEN I HEARD ABOUT CHARLOTTE'S BILL WE'RE CLOSELY FOLLOWING THE STRONGEST WAY
GROUND TRUTH= WHEN I HEARD ABOUT CHARLOTTE'S BILL WE'RE CLOSELY FOLLOWING THE STRONGEST POSSIBLE
WAV2LETTER= when i heard about charlottesville were closely following the strongest possible
tspeak0004.wav
ESPNET= AND THE BAD PEOPLE IN PEOPLE GIVING A PLATFORM TO THESE
GROUND TRUTH= COULDN'T SEE AND THE BAD PEOPLE ONLY PEOPLE GIVING A PLATFORM TO THESE
WAV2LETTER= and the sick and the bad people giving a bad form to the
tspeak0005.wav
ESPNET= COMMITTED TO PASS IN THE HISTORY OF OUR COUNTRY IF THEY DO REMEMBER THEY ARE
GROUND TRUTH= COMMITTED TO PASS BUT IN THE HISTORY OF OUR COUNTRY IF THEY DO REMEMBER THEY ARE
WAV2LETTER= are committed to part in the history of our country if they do remember they are
tspeak0006.wav
ESPNET= HISTORIC CHARACTERS FOR ALL AMERICANS TO DEFEND AMERICA
GROUND TRUTH= HISTORIC CHARACTERS FOR ALL AMERICANS TO DEFEND AMERICA
WAV2LETTER= the four great things for all a man to defend a mark
tspeak0007.wav
ESPNET= IN WORKERS AND AMERICAN FAMILIES THEY'VE ALLOWED MILLIONS OF LOW WAGE WORKERS
GROUND TRUTH= IN WORKERS AND AMERICAN FAMILIES THEY'VE ALLOWED MILLIONS OF LOW WAGE WORKERS
WAV2LETTER= and workers and american families they have allowed millions of low wages
tspeak0008.wav
ESPNET= TOYOTA AND MARSA ARE OPENING UP A PLANT IN ALI
GROUND TRUTH= TOYOTA AND MAZDA ARE OPENING UP A PLANT IN ALI
WAV2LETTER= boor and mozart are opening up a plant in our
tspeak0009.wav
ESPNET= BRYAN SAID HE FELT GOD'S CALM WE ARE ALSO WE SAW OURS
GROUND TRUTH= BUT BRYAN SAID HE FELT GOD'S CALM WE ARE ALSO WE SAW OURS
WAV2LETTER= bryan said he felt god gone we are also restoring ours
tspeak0010.wav
ESPNET= DURING THE HOUR OF THE NIGHT I CALL ON SAUNDERS TO EMPOWER THE ENDING
GROUND TRUTH= DURING HOUR TONIGHT I CALL ON CONGRESS TO EMPOWER THE ENDING GAIN
WAV2LETTER= the hour of the night i call on thousands to a power ending game
By the way, I forgot to save the results but I think the results were likely better with "decode_tds_s2s_ngram_other.cfg", I have to go in 5 minutes so I can't retest now, sorry.
I also confirm to have solved my previous problem, it was only the "audiolist.txt" file that had an incorrect formatting, I corrected like you suggested and it works now:
[utterance id] [audio file (full path)] [audio length] [word transcripts]
@AndroYD84, could you send the final WER for comparison on your dataset ESPnet, transformer and tds-inference models?
Also I noticed that we haven't some words in the lexicon while ESPnet infer them correctly, so wonder what is the lexicon and decoding they are using?
@tlikhomanenko Sorry but I'm on holiday now, I can access my PC remotely but can't do much as now it's using 100% of my GPU and CPU on another task, however, I can send the dataset privately if needed.
Is it also possible to use pre-trained models in recipes using simple_streaming_asr_example instead of the model in S3 bucket mentioned here?
Is it also possible to use pre-trained models in
recipesusing simple_streaming_asr_example instead of the model in S3 bucket mentioned here?
@avidov @vineelpratap?
Is it also possible to use pre-trained models in
recipesusing simple_streaming_asr_example instead of the model in S3 bucket mentioned here?
Yes, looking for this as well.
@AndroYD84: Did you figure out a way to use streaming/interactive decoder with SOTA 2019 models? If so, can you please share the config file?
Thanks.