wav2letter icon indicating copy to clipboard operation
wav2letter copied to clipboard

Transcribe audio in WAV format

Open Adportas opened this issue 5 years ago • 10 comments

A description of what we have done

  1. We elaborate a Spanish training using this typical example architecture with good results.
  2. Then with the model, the idea was to transcribe some audios in wav format to verify the results and facilitate the process of filtering quickly new additional data in a more automated way for more training resources.
  3. When we come to exploiting the trained model to transcribe audio from wav files using it, both examples: simple_streaming_asr_example and multithreaded_streaming_asr_example; they turn out to be incompatible requiring a conversion of the model to a "streaming format"
  4. To convert the model there is the streaming_tds_model_converter program difficult to compile. I could only do it in a virtual machine Docker CPU with some help.
  5. When we comes to converting the model, we find something "obvious" that was completely unaware of. The model had to be of type TDS as the name of the program says streaming_tds_model_converter, since it is the only compatible one and the default architecture used is not TDS.
  6. Then we find a couple of TDS architectures to train that do not train. We assume because the configuration need changes, then we search for the configurations of both architectures but still I can't get them to train, while the original non-TDS still training ok.

After all this experience the questions:

  1. Can we use the original non-TDS model to transcribe audio in WAV format? Can someone suggest a path?
  2. How can we pass the architecture and configuration files to TDS format without errors at train? is it possible?
  3. Why are there 3 different Docker machines: CPU, CUDA, Inference? Can we have all features in one physical machine?

Adportas avatar Nov 25 '20 20:11 Adportas

@Adportas: Do you require streaming inference? The simple_streaming_asr_example binary does online (streaming) inference. For that, you need to convert the TDS+CTC flashlight model to FBGEMMone.

Otherwise, you can use the decode binary to do offline inference with any architecture, not just TDS+CTC.

abhinavkulkarni avatar Nov 26 '20 06:11 abhinavkulkarni

@Adportas: Do you require streaming inference? The simple_streaming_asr_example binary does online (streaming) inference. For that, you need to convert the TDS+CTC flashlight model to FBGEMMone.

Otherwise, you can use the decode binary to do offline inference with any architecture, not just TDS+CTC.

Hi @abhinavkulkarni Thank you very much for clarifying my ideas, apparently I lost my way while trying to use simple_streaming_asr_example I ran the Decoder without parameters and it suggested this url: Beam Search Decoder Following the instructions available there create these files:

-> decode_es.cfg

--am=/home/empresa/wav2letter/modelo_nuevo_w2l/librispeech_clean_trainlogs/001_model_last.bin
--test=/home/jsanchez/audio_procesado/Transforma_las_heridas_de_tu_infancia_Seba/capitulo1/mini_wavs/chunk85.wav
--maxload=10 
--nthread_decoder=2 
--show 
--sholetters 
--lexicon=/home/jsanchez/audio_procesado/Daniel/vocabulario_libros.txt
--uselexicon=true
--lm=''
--lmtype=kenlm
--decodertype=wrd
--beamsize=100 
--beamsizetoken=100 
--beamthreshold=20 

-> comando_decode_es.sh

#!/bin/bash
/home/empresa/wav2letter/build/Decoder \
  --flagsfile /home/empresa/Daniel_Ingles/decode_es.cfg \
  --lmweight 1 \
  --wordscore 0 \
  --eosscore 0 \
  --silscore 0 \
  --unkscore 0 \
  --smearing max

Which returns these errors:

ERROR: illegal value '100 ' specified for int32 flag 'beamsize' ERROR: illegal value '100 ' specified for int32 flag 'beamsizetoken' ERROR: illegal value '20 ' specified for double flag 'beamthreshold' ERROR: illegal value '10 ' specified for int32 flag 'maxload' ERROR: illegal value '2 ' specified for int32 flag 'nthread_decoder'

I didn't modify the suggested configuration values, why they can be illegal?

Adportas avatar Nov 26 '20 16:11 Adportas

cc @vineelpratap @avidov @xuqiantong

tlikhomanenko avatar Nov 28 '20 01:11 tlikhomanenko

Hey @Adportas, I am not sure you can specify a single wav file for --test option. I think it expects a .lst file.

abhinavkulkarni avatar Dec 01 '20 05:12 abhinavkulkarni

Hey @Adportas, I am not sure you can specify a single wav file for --test option. I think it expects a .lst file.

Hi @abhinavkulkarni Thanks for the info. When adding a wavs listing file the same happens and when testing on the docker cpu machine too.

--> listado.lst

/root/wav2letter/Daniel/chunk73.wav
/root/wav2letter/Daniel/chunk70.wav
/root/wav2letter/Daniel/chunk71.wav
/root/wav2letter/Daniel/chunk75.wav
/root/wav2letter/Daniel/chunk74.wav
/root/wav2letter/Daniel/chunk72.wav

root@14ba36ae07df:~/wav2letter/Daniel# ./comando_decode_es.sh

ERROR: illegal value '100 ' specified for int32 flag 'beamsize' ERROR: illegal value '100 ' specified for int32 flag 'beamsizetoken' ERROR: illegal value '20 ' specified for double flag 'beamthreshold' ERROR: illegal value '10 ' specified for int32 flag 'maxload' ERROR: illegal value '2 ' specified for int32 flag 'nthread_decoder'

-> Reviewing another thread, I saw a lst file with more columns including the translation and duration (like in training), I imagine it as a kind of test with comparison, my intention is to transcribe some audios of which I don't have the texts, it will be that the Decoder has other ways of using it or not suitable for what I need? in that case which one should i use?

Adportas avatar Dec 01 '20 22:12 Adportas

Hey @Adportas, I am not sure you can specify a single wav file for --test option. I think it expects a .lst file.

Hi @abhinavkulkarni The errors illegal values remain no mather lst file usage. Even using a file with columns as suggested in this thread by @tlikhomanenko (1 /home/../1.wav 1234.34 hello world)

decode.lst

12 /home/empresa/Daniel_Ingles/Como_agua_para_chocolate_Seba_capitulo12_chunk11.flac 68542 dieron salida a la pasión por tantos años contenida

So I removed all the parameters with illegal error just to test what happened and this other error appeared

terminate called after throwing an instance of 'util::ErrnoException' what(): /home/empresa/kenlm/util/file.cc:76 in int util::OpenReadOrThrow(const char*) threw ErrnoException because -1 == (ret = open(name, 00))'. No such file or directory while opening '' *** Aborted at 1606941109 (unix time) try "date -d @1606941109" if you are using GNU date *** PC: @ 0x7fd65d6f4f47 gsignal *** SIGABRT (@0x3e800003c51) received by PID 15441 (TID 0x7fd6842ca380) from PID 15441; stack trace: *** @ 0x7fd65ec648a0 (unknown) @ 0x7fd65d6f4f47 gsignal @ 0x7fd65d6f68b1 abort @ 0x7fd65e318957 (unknown) @ 0x7fd65e31eae6 (unknown) @ 0x7fd65e31eb21 std::terminate() @ 0x7fd65e31ed54 __cxa_throw @ 0x55e173d6c313 util::OpenReadOrThrow() @ 0x55e173d6af6e lm::ngram::RecognizeBinary() @ 0x55e173d2bbbc lm::ngram::LoadVirtual() @ 0x55e173c50db9 w2l::KenLM::KenLM() @ 0x55e173b08700 main @ 0x7fd65d6d7b97 __libc_start_main @ 0x55e173b6474a _start ./comando_decode_es.sh: línea 9: 15441 Abortado (core' generado) /home/adportas/wav2letter/build/Decoder --flagsfile /home/adportas/Daniel_Ingles/decode_es.cfg --lmweight 1 --wordscore 0 --eosscore 0 --silscore 0 --unkscore 0 --smearing max

-> As if it did not find a file or rather its name came in blank (No such file or directory while opening ''), will it be because of the parameters that I remove or is it another problem known? Thankś in advance Daniel

Adportas avatar Dec 02 '20 20:12 Adportas

Hey @Adportas,

I am not sure if this helps you, but here is an output I obtained by running Test binary to generate transcriptions:

$ /home/w2luser/Projects/wav2letter/cmake-build-debug-remote/Test --am /data/podcaster/model/wav2letter/am_500ms_future_context_dev_other/am_500ms_future_context_dev_other.bin --test /home/w2luser/w2l/lists/dev-clean.lst --maxload 10 --tokens /data/podcaster/model/wav2letter/am_500ms_future_context_dev_other/librispeech-train-all-unigram-10000.tokens --lexicon /data/podcaster/model/wav2letter/am_500ms_future_context_dev_other/decoder-unigram-10000-nbest10.lexicon --show --showletters --emission_dir /tmp/emissions --sclite /tmp/sclite
I20201203 07:52:30.939355 24702 Test.cpp:44] Parsing command line flags
I20201203 07:52:30.939424 24702 Test.cpp:56] [Network] Reading acoustic model from /data/podcaster/model/wav2letter/am_500ms_future_context_dev_other/am_500ms_future_context_dev_other.bin
I20201203 07:52:31.734445 24702 Test.cpp:62] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> output]
	(0): View (-1 80 1 0)
	(1): SpecAugment ( W: 80, F: 27, mF: 2, T: 100, p: 1, mT: 2 )
	(2): Padding (0, { (5, 3), (0, 0), (0, 0), (0, 0), })
	(3): Conv2D (1->15, 10x1, 2,1, 0,0, 1, 1) (with bias)
	(4): ReLU
	(5): Dropout (0.100000)
	(6): LayerNorm ( axis : { 1 2 } , size : -1)
	(7): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200]
	(8): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200]
	(9): Padding (0, { (7, 1), (0, 0), (0, 0), (0, 0), })
	(10): Conv2D (15->19, 10x1, 2,1, 0,0, 1, 1) (with bias)
	(11): ReLU
	(12): Dropout (0.100000)
	(13): LayerNorm ( axis : { 1 2 } , size : -1)
	(14): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
	(15): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
	(16): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
	(17): Padding (0, { (9, 1), (0, 0), (0, 0), (0, 0), })
	(18): Conv2D (19->23, 12x1, 2,1, 0,0, 1, 1) (with bias)
	(19): ReLU
	(20): Dropout (0.100000)
	(21): LayerNorm ( axis : { 1 2 } , size : -1)
	(22): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
	(23): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
	(24): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
	(25): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
	(26): Padding (0, { (10, 0), (0, 0), (0, 0), (0, 0), })
	(27): Conv2D (23->27, 11x1, 1,1, 0,0, 1, 1) (with bias)
	(28): ReLU
	(29): Dropout (0.100000)
	(30): LayerNorm ( axis : { 1 2 } , size : -1)
	(31): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
	(32): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
	(33): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
	(34): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
	(35): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
	(36): Reorder (2,1,0,3)
	(37): View (2160 -1 1 0)
	(38): Linear (2160->9998) (with bias)
	(39): View (9998 0 -1 1)
I20201203 07:52:31.734565 24702 Test.cpp:63] [Criterion] ConnectionistTemporalClassificationCriterion
I20201203 07:52:31.734570 24702 Test.cpp:64] [Network] Number of params: 115111823
I20201203 07:52:31.734588 24702 Test.cpp:70] [Network] Updating flags from config file: /data/podcaster/model/wav2letter/am_500ms_future_context_dev_other/am_500ms_future_context_dev_other.bin
I20201203 07:52:31.734803 24702 Test.cpp:83] Gflags after parsing 
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/data/podcaster/model/wav2letter/am_500ms_future_context_dev_other/am_500ms_future_context_dev_other.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=arch.txt; --archdir=; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=8; --beamsize=2500; --beamsizetoken=250000; --beamthreshold=25; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=/tmp/emissions; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=1000000; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/data/podcaster/model/wav2letter/am_500ms_future_context_dev_other/decoder-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0; --localnrmlleftctx=300; --localnrmlrightctx=0; --logadd=false; --lr=0.40000000000000002; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=0.5; --maxisz=33000; --maxload=10; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=200; --minrate=3; --minsil=0; --mintsz=2; --momentum=0; --netoptim=sgd; --noresample=false; --nthread=6; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=1; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=2500; --rightWindowSize=50; --rndv_filepath=/checkpoint/vineelkpratap/experiments/speech/inference_tds//inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8/rndvz.21621542; --rundir=/checkpoint/vineelkpratap/experiments/speech/inference_tds/; --runname=inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=/tmp/sclite; --seed=0; --show=true; --showletters=true; --silscore=0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=; --tag=; --target=tkn; --test=/home/w2luser/w2l/lists/dev-clean.lst; --tokens=/data/podcaster/model/wav2letter/am_500ms_future_context_dev_other/librispeech-train-all-unigram-10000.tokens; --tokensdir=; --train=/checkpoint/antares/datasets/librispeech/lists/train-clean-100.lst,/checkpoint/antares/datasets/librispeech/lists/train-clean-360.lst,/checkpoint/antares/datasets/librispeech/lists/train-other-500.lst,/checkpoint/vineelkpratap/experiments/speech/librivox.cut.sub36s.datasets.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=/checkpoint/antares/datasets/librispeech/lists/dev-clean.lst,/checkpoint/antares/datasets/librispeech/lists/dev-other.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0; --wordseparator=_; --world_rank=0; --world_size=32; 
I20201203 07:52:31.753192 24702 Test.cpp:104] Number of classes (network): 9998
I20201203 07:52:38.735880 24702 Test.cpp:111] Number of words: 200001
Falling back to using letters as targets for the unknown word: quilter's
Falling back to using letters as targets for the unknown word: shampooer
Falling back to using letters as targets for the unknown word: ruggedo's
Falling back to using letters as targets for the unknown word: brion's
Falling back to using letters as targets for the unknown word: buzzer's
Falling back to using letters as targets for the unknown word: brandd
Falling back to using letters as targets for the unknown word: irolg
Falling back to using letters as targets for the unknown word: irolg
Falling back to using letters as targets for the unknown word: mainhall
Falling back to using letters as targets for the unknown word: burgoynes
Falling back to using letters as targets for the unknown word: mainhall
Falling back to using letters as targets for the unknown word: mainhall
Falling back to using letters as targets for the unknown word: westmere
Falling back to using letters as targets for the unknown word: dowle
Falling back to using letters as targets for the unknown word: mainhall
Falling back to using letters as targets for the unknown word: docetes
Falling back to using letters as targets for the unknown word: novatians
Falling back to using letters as targets for the unknown word: recuperations
Falling back to using letters as targets for the unknown word: chaba
Falling back to using letters as targets for the unknown word: zingiber
Falling back to using letters as targets for the unknown word: officinale
Falling back to using letters as targets for the unknown word: shimerda
Falling back to using letters as targets for the unknown word: shimerda
Falling back to using letters as targets for the unknown word: corncakes
Falling back to using letters as targets for the unknown word: ambrosch
Falling back to using letters as targets for the unknown word: ambrosch
Falling back to using letters as targets for the unknown word: yulka
Falling back to using letters as targets for the unknown word: yulka
Falling back to using letters as targets for the unknown word: shimerda
Falling back to using letters as targets for the unknown word: shimerda
Falling back to using letters as targets for the unknown word: ambrosch
Falling back to using letters as targets for the unknown word: shimerdas
Falling back to using letters as targets for the unknown word: ambrosch
Falling back to using letters as targets for the unknown word: shimerda
Falling back to using letters as targets for the unknown word: shimerda
Falling back to using letters as targets for the unknown word: shimerda
Falling back to using letters as targets for the unknown word: congal
Falling back to using letters as targets for the unknown word: congal's
Falling back to using letters as targets for the unknown word: finnacta
Falling back to using letters as targets for the unknown word: moling
Falling back to using letters as targets for the unknown word: dhourra
Falling back to using letters as targets for the unknown word: daguerreotypist
Falling back to using letters as targets for the unknown word: daguerreotypist
Falling back to using letters as targets for the unknown word: drouet's
Falling back to using letters as targets for the unknown word: hurstwood
Falling back to using letters as targets for the unknown word: hurstwood
Falling back to using letters as targets for the unknown word: hurstwood
Falling back to using letters as targets for the unknown word: hurstwood
Falling back to using letters as targets for the unknown word: hurstwood
Falling back to using letters as targets for the unknown word: rangitata
Falling back to using letters as targets for the unknown word: shoplets
Falling back to using letters as targets for the unknown word: bush'
Falling back to using letters as targets for the unknown word: vendhya
Falling back to using letters as targets for the unknown word: khosala
Falling back to using letters as targets for the unknown word: bhunda
Falling back to using letters as targets for the unknown word: doma
Falling back to using letters as targets for the unknown word: herbivore
Falling back to using letters as targets for the unknown word: darfhulva
Falling back to using letters as targets for the unknown word: telemetering
Falling back to using letters as targets for the unknown word: wahiti
Falling back to using letters as targets for the unknown word: glenarvan's
Falling back to using letters as targets for the unknown word: olbinett
Falling back to using letters as targets for the unknown word: mulrady
Falling back to using letters as targets for the unknown word: theosophies
Falling back to using letters as targets for the unknown word: satisfier
Falling back to using letters as targets for the unknown word: homoousios
Falling back to using letters as targets for the unknown word: homoiousios
Falling back to using letters as targets for the unknown word: synesius's
Falling back to using letters as targets for the unknown word: riverlike
Falling back to using letters as targets for the unknown word: heuchera
Falling back to using letters as targets for the unknown word: hennerberg
Falling back to using letters as targets for the unknown word: parrishes
Falling back to using letters as targets for the unknown word: magazzino
Falling back to using letters as targets for the unknown word: magazzino
Falling back to using letters as targets for the unknown word: razetta
Falling back to using letters as targets for the unknown word: gingle
Falling back to using letters as targets for the unknown word: bozzle
Falling back to using letters as targets for the unknown word: bozzle
Falling back to using letters as targets for the unknown word: bozzle
Falling back to using letters as targets for the unknown word: bozzle's
Falling back to using letters as targets for the unknown word: bozzle
Falling back to using letters as targets for the unknown word: bozzle
Falling back to using letters as targets for the unknown word: bozzle
Falling back to using letters as targets for the unknown word: bozzle
Falling back to using letters as targets for the unknown word: bozzle
Falling back to using letters as targets for the unknown word: bozzle
Falling back to using letters as targets for the unknown word: bozzle
Falling back to using letters as targets for the unknown word: bozzle
Falling back to using letters as targets for the unknown word: skint
Falling back to using letters as targets for the unknown word: bozzle
Falling back to using letters as targets for the unknown word: bozzle
Falling back to using letters as targets for the unknown word: bozzle
Falling back to using letters as targets for the unknown word: untrussing
Falling back to using letters as targets for the unknown word: lacquey's
Falling back to using letters as targets for the unknown word: balvastro
Falling back to using letters as targets for the unknown word: troke
Falling back to using letters as targets for the unknown word: troke
Falling back to using letters as targets for the unknown word: macklewain
Falling back to using letters as targets for the unknown word: macklewain
Falling back to using letters as targets for the unknown word: macklewain
Falling back to using letters as targets for the unknown word: macklewain
Falling back to using letters as targets for the unknown word: macklewain
Falling back to using letters as targets for the unknown word: macklewain
Falling back to using letters as targets for the unknown word: macklewain
Falling back to using letters as targets for the unknown word: macklewain
Falling back to using letters as targets for the unknown word: macklewain's
Falling back to using letters as targets for the unknown word: troke
Falling back to using letters as targets for the unknown word: derivatively
Falling back to using letters as targets for the unknown word: pinkies
Falling back to using letters as targets for the unknown word: boolooroo
Falling back to using letters as targets for the unknown word: pinkies
Falling back to using letters as targets for the unknown word: pinkies
Falling back to using letters as targets for the unknown word: pinkies
Falling back to using letters as targets for the unknown word: bambeday
Falling back to using letters as targets for the unknown word: bambeday
Falling back to using letters as targets for the unknown word: frierson's
Falling back to using letters as targets for the unknown word: ganny
Falling back to using letters as targets for the unknown word: gwynplaine's
Falling back to using letters as targets for the unknown word: gwynplaine's
Falling back to using letters as targets for the unknown word: gwynplaine's
Falling back to using letters as targets for the unknown word: fibi
Falling back to using letters as targets for the unknown word: vinos
Falling back to using letters as targets for the unknown word: fibi
Falling back to using letters as targets for the unknown word: vinos
Falling back to using letters as targets for the unknown word: tarrinzeau
Falling back to using letters as targets for the unknown word: nicless
Falling back to using letters as targets for the unknown word: hardwigg
Falling back to using letters as targets for the unknown word: myrdals
Falling back to using letters as targets for the unknown word: yokul
Falling back to using letters as targets for the unknown word: trampe
Falling back to using letters as targets for the unknown word: saknussemm
Falling back to using letters as targets for the unknown word: sudvestr
Falling back to using letters as targets for the unknown word: fjordungr
Falling back to using letters as targets for the unknown word: sneffels
Falling back to using letters as targets for the unknown word: gardar
Falling back to using letters as targets for the unknown word: sprucewood
Falling back to using letters as targets for the unknown word: delaunay's
Falling back to using letters as targets for the unknown word: testbridge
Falling back to using letters as targets for the unknown word: tumble's
Falling back to using letters as targets for the unknown word: canyou
Falling back to using letters as targets for the unknown word: beenie
Falling back to using letters as targets for the unknown word: beenie
Falling back to using letters as targets for the unknown word: bergez
Falling back to using letters as targets for the unknown word: rathskellers
Falling back to using letters as targets for the unknown word: weiser
Falling back to using letters as targets for the unknown word: scheiler
Falling back to using letters as targets for the unknown word: collander
Falling back to using letters as targets for the unknown word: brau
Falling back to using letters as targets for the unknown word: brau
Falling back to using letters as targets for the unknown word: abalone's
Falling back to using letters as targets for the unknown word: abalone's
Falling back to using letters as targets for the unknown word: abalone's
Falling back to using letters as targets for the unknown word: brau
Falling back to using letters as targets for the unknown word: ossipon
Falling back to using letters as targets for the unknown word: ossipon
Falling back to using letters as targets for the unknown word: verloc
Falling back to using letters as targets for the unknown word: yundt
Falling back to using letters as targets for the unknown word: ossipon's
Falling back to using letters as targets for the unknown word: ossipon's
Falling back to using letters as targets for the unknown word: verloc
Falling back to using letters as targets for the unknown word: verloc
Falling back to using letters as targets for the unknown word: verloc's
Falling back to using letters as targets for the unknown word: verloc
Falling back to using letters as targets for the unknown word: yundt
Falling back to using letters as targets for the unknown word: verloc
Falling back to using letters as targets for the unknown word: birdikins
Falling back to using letters as targets for the unknown word: tishimingo
Falling back to using letters as targets for the unknown word: breadhouse
Falling back to using letters as targets for the unknown word: bennydeck
Falling back to using letters as targets for the unknown word: bennydeck
Falling back to using letters as targets for the unknown word: presty
Falling back to using letters as targets for the unknown word: presty
Falling back to using letters as targets for the unknown word: sandyseal
Falling back to using letters as targets for the unknown word: presty
Falling back to using letters as targets for the unknown word: bennydeck
Falling back to using letters as targets for the unknown word: bennydeck
Falling back to using letters as targets for the unknown word: presty
Falling back to using letters as targets for the unknown word: bennydeck's
Falling back to using letters as targets for the unknown word: d'avrigny
Falling back to using letters as targets for the unknown word: d'avrigny
Falling back to using letters as targets for the unknown word: noirtier
Falling back to using letters as targets for the unknown word: noirtier
Falling back to using letters as targets for the unknown word: noirtier
Falling back to using letters as targets for the unknown word: noirtier
Falling back to using letters as targets for the unknown word: d'avrigny
Falling back to using letters as targets for the unknown word: delectasti
Falling back to using letters as targets for the unknown word: libano
Falling back to using letters as targets for the unknown word: quinci
Falling back to using letters as targets for the unknown word: impara
Falling back to using letters as targets for the unknown word: stupirti
I20201203 07:52:39.894254 24702 W2lListFilesDataset.cpp:141] 2703 files found. 
I20201203 07:52:39.894377 24702 Utils.cpp:102] Filtered 2/2703 samples
I20201203 07:52:39.896100 24702 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 2701
I20201203 07:52:39.896993 24702 Test.cpp:131] [Dataset] Dataset loaded.
Falling back to using letters as targets for the unknown word: rathskellers
Falling back to using letters as targets for the unknown word: weiser
|T|: o n e _ c a n _ a l m o s t _ i m a g i n e _ h i m s e l f _ i n _ o n e _ o f _ t h e _ f a m o u s _ r a t h s k e l l e r s _ o f _ o l d _ h e i d e l b e r g _ n o t _ a t _ t h e _ s c h l o s s _ o f _ c o u r s e _ f o r _ h e r e _ y o u _ c a n n o t _ l o o k _ d o w n _ o n _ t h e _ w e i s e r _ a s _ i t _ f l o w s _ b e n e a t h _ t h e _ w i n d o w s _ o f _ t h e _ g r e a t _ w i n e _ s t u b e _ o n _ t h e _ h i l l
|P|: o n e _ c a n _ a l m o s t _ i m a g i n e _ h i m s e l f _ i n _ o n e _ o f _ t h e _ f a m o u s r o i l l e r s _ o f _ o l d _ h e i d e l b e r g _ n o t _ a t _ t h e _ s c c h l o s _ o f _ c o u r s e _ f o r _ h e r e _ y o u _ c a n n o t _ l o o k _ d o w n _ o n _ t h e _ v i s o r _ a s _ i t _ f l o w s _ b e n e a t h _ t h e _ w i n d o w s _ o f _ t h e _ g r e a t _ w i n e _ s t o o d _ o n _ t h e _ h i l l
[sample: dev-clean-652-130726-0021, WER: 11.6279%, LER: 6.72646%, total WER: 11.6279%, total LER: 6.72646%, progress (thread 0): 10%]
|T|: s a n c h o _ r o s e _ a n d _ r e m o v e d _ s o m e _ d i s t a n c e _ f r o m _ t h e _ s p o t _ b u t _ a s _ h e _ w a s _ a b o u t _ t o _ p l a c e _ h i m s e l f _ l e a n i n g _ a g a i n s t _ a n o t h e r _ t r e e _ h e _ f e l t _ s o m e t h i n g _ t o u c h _ h i s _ h e a d _ a n d _ p u t t i n g _ u p _ h i s _ h a n d s _ e n c o u n t e r e d _ s o m e b o d y ' s _ t w o _ f e e t _ w i t h _ s h o e s _ a n d _ s t o c k i n g s _ o n _ t h e m
|P|: s a n c h o _ r o s e _ a n d _ r e m o v e d _ s o m e _ d i s t a n c e _ f r o m _ t h e _ s p o t _ b u t _ a s _ h e _ w a s _ a b o u t _ t o _ p l a c e _ h i m s e l f _ l e a n i n g _ a g a i n s t _ a n o t h e r _ t r e e _ h e _ f e l t _ s o m e t h i n g _ t o u c h _ h i s _ h e a d _ a n d _ p u t t i n g _ u p _ h i s _ h a n d s _ e n c o u n t e r e d _ s o m e b o d y ' s _ t w o _ f e e t _ w i t h _ s h o e s _ a n d _ s t o c k i n g s _ o n _ t h e m
[sample: dev-clean-3576-138058-0010, WER: 0%, LER: 0%, total WER: 5.88235%, total LER: 3.23974%, progress (thread 0): 20%]
|T|: s h e _ b i t _ h e r _ l i p _ a n d _ l o o k e d _ d o w n _ a t _ h e r _ h a n d s _ w h i c h _ w e r e _ c l a s p e d _ t i g h t l y _ i n _ f r o n t _ o f _ h e r
|P|: s h e _ b i t _ h e r _ l i p _ a n d _ l o o k e d _ d o w n _ a t _ h e r _ h a n d s _ w h i c h _ w e r e _ c l a s p e d _ t i g h t l y _ i n _ f r o n t _ o f _ h e r
[sample: dev-clean-1462-170142-0020, WER: 0%, LER: 0%, total WER: 4.85437%, total LER: 2.72727%, progress (thread 0): 30%]
Falling back to using letters as targets for the unknown word: bergez
|T|: w h e r e _ i s _ t h a t
|P|: w h e r e _ i s _ t h a t
[sample: dev-clean-1272-135031-0014, WER: 0%, LER: 0%, total WER: 4.71698%, total LER: 2.6643%, progress (thread 0): 40%]
|T|: h e _ s e e m e d _ t o _ b e _ t h i n k i n g _ o f _ s o m e t h i n g _ e l s e
|P|: h e _ s e e m e d _ t o _ b e _ t h i n k i n g _ o f _ s o m e t h i n g _ e l s e
[sample: dev-clean-2277-149874-0011, WER: 0%, LER: 0%, total WER: 4.38596%, total LER: 2.47934%, progress (thread 0): 50%]
|T|: i _ o n l y _ w i s h _ t o _ b e _ a l o n e _ y o u _ w i l l _ e x c u s e _ m e _ w i l l _ y o u _ n o t
|P|: i _ o n l y _ w i s h _ t o _ b e _ a l o n e _ y o u _ w i l l _ e x c u s e _ m e _ w i l l _ y o u _ n o t
[sample: dev-clean-84-121123-0027, WER: 0%, LER: 0%, total WER: 3.93701%, total LER: 2.27273%, progress (thread 0): 60%]
|T|: s h e _ w a s _ o n e _ o f _ a _ l a r g e _ c o m p a n y _ a t _ a _ h o u s e _ w h e r e _ s h e _ h a d _ n e v e r _ b e e n _ b e f o r e _ a _ b e a u t i f u l _ h o u s e _ w i t h _ a _ l a r g e _ g a r d e n _ b e h i n d
|P|: s h e _ w a s _ o n e _ o f _ a _ l a r g e _ c o m p a n y _ a t _ a _ h o u s e _ w h e r e _ s h e _ h a d _ n e v e r _ b e e n _ b e f o r e _ a _ b e a u t i f u l _ h o u s e _ w i t h _ a _ l a r g e _ g a r d e n _ b e h i n d
[sample: dev-clean-6345-64257-0005, WER: 0%, LER: 0%, total WER: 3.31126%, total LER: 1.92802%, progress (thread 0): 70%]
|T|: h e _ s e e m e d _ t o _ b e _ c u r s i n g _ p e o p l e _ w h o _ h a d _ w r o n g e d _ h i m
|P|: h e _ s e e m e d _ t o _ b e _ c u r s i n g _ p e o p l e _ w h o _ h a d _ w r o n g e d _ h i m
[sample: dev-clean-2035-147961-0010, WER: 0%, LER: 0%, total WER: 3.10559%, total LER: 1.81159%, progress (thread 0): 80%]
|T|: c l a u d i a _ w o u l d _ n o t _ o n _ a n y _ a c c o u n t _ a l l o w _ h i m _ t o _ a c c o m p a n y _ h e r _ a n d _ t h a n k i n g _ h i m _ f o r _ h i s _ o f f e r s _ a s _ w e l l _ a s _ s h e _ c o u l d _ t o o k _ l e a v e _ o f _ h i m _ i n _ t e a r s
|P|: c l a u d i a _ w o u l d _ n o t _ o n _ a n y _ a c c o u n t _ a l l o w _ h i m _ t o _ a c c o m p a n y _ h e r _ a n d _ t h a n k i n g _ h i m _ f o r _ h i s _ o f f e r s _ a s _ w e l l _ a s _ s h e _ c o u l d _ t o o k _ l e a v e _ o f _ h i m _ i n _ t e a r s
[sample: dev-clean-3576-138058-0029, WER: 0%, LER: 0%, total WER: 2.6455%, total LER: 1.55119%, progress (thread 0): 90%]
|T|: t h e _ r e s t a u r a n t s _ o f _ t h e _ p r e s e n t _ d a y _ t h a t _ a p p r o a c h _ n e a r e s t _ t h e _ o l d _ b o h e m i a n _ r e s t a u r a n t s _ o f _ p r e _ f i r e _ d a y s _ o f _ t h e _ f r e n c h _ c l a s s _ a r e _ j a c k ' s _ i n _ s a c r a m e n t o _ s t r e e t _ b e t w e e n _ m o n t g o m e r y _ a n d _ k e a r n y _ f e l i x _ i n _ m o n t g o m e r y _ s t r e e t _ b e t w e e n _ c l a y _ a n d _ w a s h i n g t o n _ a n d _ t h e _ p o o d l e _ d o g _ b e r g e z _ f r a n k s _ i n _ b u s h _ s t r e e t _ b e t w e e n _ k e a r n y _ a n d _ g r a n t _ a v e n u e
|P|: t h e _ r e s t a u r a n t s _ o f _ t h e _ p r e s e n t _ d a y _ t h a t _ a p p r o a c h _ n e a r e s t _ t h e _ o l d _ b o h e m i a n _ r e s t a u r a n t s _ o f _ p r e _ f i r e _ d a y s _ o f _ t h e _ f r e n c h _ c l a s s _ a r e _ j a c k s _ a n d _ s a c r a m e n t o _ s t r e e t _ b e t w e e n _ m o n t g o m e r y _ a n d _ k i n y _ f e l i x _ i n _ m o n t g o m e r y _ s t r e e t _ b e t w e e n _ c l a y _ a n d _ w a s h i n g t o n _ a n d _ t h e _ p o o d l e _ d o g _ b e s _ f r a n k _ i n _ b u s h _ s t r e e t _ b e t w e e n _ k n e y _ a n d _ g r a n t _ a v e n u e
[sample: dev-clean-652-130726-0011, WER: 11.5385%, LER: 4.70219%, total WER: 4.56432%, total LER: 2.33281%, progress (thread 0): 100%]
I20201203 07:52:41.808136 24702 Test.cpp:317] ------
I20201203 07:52:41.808157 24702 Test.cpp:318] [Test /home/w2luser/w2l/lists/dev-clean.lst (10 samples) in 1.91103s (actual decoding time 0.191s/sample) -- WER: 4.56432, LER: 2.33281]

Process finished with exit code 0

As you can see, I specified sclite parameter that is the name of the directory in which stderr and stdout logs will be generated. In it, you can find a file that contains the top hypotheses from the beam search decoder on your examples.

This is the flashlight model from streaming convnets recipe.

And yes, the .lst file has reference transcriptions using which WER is calculated and is shown in the output. You can probably fill that with some dummy test.

Let me know if this helps.

abhinavkulkarni avatar Dec 03 '20 15:12 abhinavkulkarni

@Adportas: From your original post:

After all this experience the questions:

  1. Can we use the original non-TDS model to transcribe audio in WAV format? Can someone suggest a path?
  2. How can we pass the architecture and configuration files to TDS format without errors at train? is it possible?
  3. Why are there 3 different Docker machines: CPU, CUDA, Inference? Can we have all features in one physical machine?

Have you tried training with steraming convnets recipe? Unless you have a strong preference for other architectures and/or are targetting higher accuracy (in lieu of more complex architecture), then you should certainly look at seq2seq or transformer architectures.

You should certainly search the forums about other people's experiences training streaming convnets recipe, I had trained it for few iterations and the loss certainly seemed to be going down. Beware though, about the compute power needed and number of days/weeks needed to train a model from scratch.

The CPU and CUDA docker containers are there to run Train/Test/Decoder binaries on CPU and CUDA backends. The inference docker container is there for running online streaming examples.

abhinavkulkarni avatar Dec 03 '20 16:12 abhinavkulkarni

Hi @abhinavkulkarni Thanks for sharing your knowledge with me. Finally I was able to clarify the cause of the illegal values ​​that were simply a blank space at the end of each line with numeric values ​​in the example page copied by me. Here you can see how they appear when marking to copy

Captura de pantalla de 2020-12-07 18-51-05

The error that came later was because I left the lm as '' as suggested by the documentation and this is not accepted, by eliminating that parameter I was able to execute the program

Captura de pantalla de 2020-12-07 19-22-52

The results are not good at all but at least it works, now I will have to advance in that:

|T|: dieron salida a la pasión por tantos años contenida |P|: a n da a n por os s a [sample: 12, WER: 88.8889%, LER: 13.7255%, slice WER: 88.8889%, slice LER: 13.7255%, decoded samples (thread 1): 1] I1207 22:07:50.258610 66 Decode.cpp:721] ------ [Decode /home/empresa/Daniel_Ingles/decode.lst (1 samples) in 1.35016s (actual decoding time 0.0396s/sample) -- WER: 88.8889, LER: 13.7255]

Thank's Daniel

Adportas avatar Dec 07 '20 22:12 Adportas

You can simply run Test.cpp to see argmax output and its WER. So this is additional check on correct usage of decode.cpp. Decode.cpp should give you better result with proper tuning of parameters. Result from Test.cpp is your upper bound on the best result.

tlikhomanenko avatar Dec 10 '20 08:12 tlikhomanenko