wav2letter
wav2letter copied to clipboard
Decoder hangs up with pre-trained model
Hi, Could you please advice. I've tried to run decoder with streaming_convnets pre-build model and it hangs on at step:
I0911 09:35:51.761121 5618 W2lListFilesDataset.cpp:141] 1 files found.
I0911 09:35:51.761857 5618 Utils.cpp:102] Filtered 1/1 samples
I0911 09:35:51.762053 5618 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 0
I0911 09:35:51.762890 5628 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 2
I0911 09:35:51.763069 5629 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 0
I0911 09:35:51.763093 5630 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 1
Reproduction Steps:
docker run -v ~/ML/recipes/streaming_convnets:/root/host --rm -itd --ipc=host --name w2l_streaming_convnets wav2letter/wav2letter:cpu-latestall other commands inside the docker container:export LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64:$LD_IBRARY_PATHcd /root/wav2letter/build
./Decoder \
--flagsfile /root/host/model/decode_500ms_right_future_ngram_other.cfg \
--lm=/root/host/model/3-gram.pruned.3e-7.bin.qt \
--lmweight=0.5515838301157 \
--wordscore=0.52526055643809 \
--minloglevel=0 \
--logtostderr=1 \
--nthread_decoder=3
My file structure in /root/host
root@9c696da2978d:~/host# tree
.
|-- 5.wav
|-- model
| |-- 3-gram.pruned.3e-7.bin.qt
| |-- am
| | `-- librispeech-train-all-unigram-10000.tokens
| |-- am_500ms_future_context.arch
| |-- am_500ms_future_context_dev_other.bin
| |-- decode_500ms_right_future_ngram_other.cfg
| `-- decoder
| `-- decoder-unigram-10000-nbest10.lexicon
|-- test.lst
|-- test.lst.hyp
|-- test.lst.log
`-- test.lst.ref
test.lst.hyp, test.lst.log and test.lst.ref are empty and not updated.
Config: What i have in decode_500ms_right_future_ngram_other.cfg
# Decoding config for Librispeech
# Replace `[...]`, `[DATA_DST]`, `[MODEL_DST]` with appropriate paths
# for test-other (best params for dev-other)
--am=/root/host/model/am_500ms_future_context_dev_other.bin
--tokensdir=/root/host/model/am
--tokens=librispeech-train-all-unigram-10000.tokens
--lexicon=/root/host/model/decoder/decoder-unigram-10000-nbest10.lexicon
--datadir=/root/host
--test=test.lst
--uselexicon=true
--sclite=/root/host
--decodertype=wrd
--lmtype=kenlm
--silscore=0
--beamsize=500
--beamsizetoken=100
--beamthreshold=100
--nthread_decoder=8
--smearing=max
--show
--showletters
List: What i have in test.lst (No ground truth, because in description it is said, that it is not obligatory wiki)
0: /root/host/5.wav 10000
Audio: 5.wav file is 10 sec length, 16kHz, 1 Channel(Mono), 16 bits
Used: Macbook 15". CPU docker image with 8 RAM and 3 cores (No GPU)
Also:
I've tried to re-build in wav2letter:cpu-latest flashlight and wav2letter with v0.2 branches as it is described in dependencies section for model readme.
The result is the same, decoder hangs on.
Full Log:
root@9c696da2978d:~/wav2letter/build# ./Decoder \
> --flagsfile /root/host/model/decode_500ms_right_future_ngram_other.cfg \
> --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt \
> --lmweight=0.5515838301157 \
> --wordscore=0.52526055643809 \
> --minloglevel=0 \
> --logtostderr=1 \
> --nthread_decoder=3
I0911 11:03:28.951413 5675 Decode.cpp:58] Reading flags from file /root/host/model/decode_500ms_right_future_ngram_other.cfg
I0911 11:03:28.955456 5675 Decode.cpp:75] [Network] Reading acoustic model from /root/host/model/am_500ms_future_context_dev_other.bin
I0911 11:03:30.589258 5675 Decode.cpp:79] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> output]
(0): View (-1 80 1 0)
(1): SpecAugment ( W: 80, F: 27, mF: 2, T: 100, p: 1, mT: 2 )
(2): Padding (0, { (5, 3), (0, 0), (0, 0), (0, 0), })
(3): Conv2D (1->15, 10x1, 2,1, 0,0, 1, 1) (with bias)
(4): ReLU
(5): Dropout (0.100000)
(6): LayerNorm ( axis : { 1 2 } , size : -1)
(7): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200]
(8): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200]
(9): Padding (0, { (7, 1), (0, 0), (0, 0), (0, 0), })
(10): Conv2D (15->19, 10x1, 2,1, 0,0, 1, 1) (with bias)
(11): ReLU
(12): Dropout (0.100000)
(13): LayerNorm ( axis : { 1 2 } , size : -1)
(14): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
(15): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
(16): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
(17): Padding (0, { (9, 1), (0, 0), (0, 0), (0, 0), })
(18): Conv2D (19->23, 12x1, 2,1, 0,0, 1, 1) (with bias)
(19): ReLU
(20): Dropout (0.100000)
(21): LayerNorm ( axis : { 1 2 } , size : -1)
(22): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
(23): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
(24): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
(25): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
(26): Padding (0, { (10, 0), (0, 0), (0, 0), (0, 0), })
(27): Conv2D (23->27, 11x1, 1,1, 0,0, 1, 1) (with bias)
(28): ReLU
(29): Dropout (0.100000)
(30): LayerNorm ( axis : { 1 2 } , size : -1)
(31): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(32): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(33): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(34): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(35): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(36): Reorder (2,1,0,3)
(37): View (2160 -1 1 0)
(38): Linear (2160->9998) (with bias)
(39): View (9998 0 -1 1)
I0911 11:03:30.589411 5675 Decode.cpp:82] [Criterion] ConnectionistTemporalClassificationCriterion
I0911 11:03:30.589418 5675 Decode.cpp:84] [Network] Number of params: 115111823
I0911 11:03:30.589581 5675 Decode.cpp:90] [Network] Updating flags from config file: /root/host/model/am_500ms_future_context_dev_other.bin
I0911 11:03:30.593014 5675 Decode.cpp:106] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/root/host/model/am_500ms_future_context_dev_other.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=arch.txt; --archdir=; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=8; --beamsize=500; --beamsizetoken=100; --beamthreshold=100; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=/root/host; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/root/host/model/decode_500ms_right_future_ngram_other.cfg; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=1000000; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/root/host/model/decoder/decoder-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0.5515838301157; --localnrmlleftctx=300; --localnrmlrightctx=0; --logadd=false; --lr=0.40000000000000002; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=0.5; --maxisz=33000; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=200; --minrate=3; --minsil=0; --mintsz=2; --momentum=0; --netoptim=sgd; --noresample=false; --nthread=6; --nthread_decoder=3; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=1; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=2500; --rightWindowSize=50; --rndv_filepath=/checkpoint/vineelkpratap/experiments/speech/inference_tds//inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8/rndvz.21621542; --rundir=/checkpoint/vineelkpratap/experiments/speech/inference_tds/; --runname=inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=/root/host; --seed=0; --show=true; --showletters=true; --silscore=0; --smearing=max; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=; --tag=; --target=tkn; --test=test.lst; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/root/host/model/am; --train=/checkpoint/antares/datasets/librispeech/lists/train-clean-100.lst,/checkpoint/antares/datasets/librispeech/lists/train-clean-360.lst,/checkpoint/antares/datasets/librispeech/lists/train-other-500.lst,/checkpoint/vineelkpratap/experiments/speech/librivox.cut.sub36s.datasets.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=/checkpoint/antares/datasets/librispeech/lists/dev-clean.lst,/checkpoint/antares/datasets/librispeech/lists/dev-other.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0.52526055643809; --wordseparator=_; --world_rank=0; --world_size=32; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I0911 11:03:30.603224 5675 Decode.cpp:127] Number of classes (network): 9998
I0911 11:03:32.673702 5675 Decode.cpp:134] Number of words: 200001
I0911 11:03:32.904979 5675 Decode.cpp:247] [Decoder] LM constructed.
I0911 11:03:36.306375 5675 Decode.cpp:274] [Decoder] Trie planted.
I0911 11:03:36.761252 5675 Decode.cpp:286] [Decoder] Trie smeared.
I0911 11:03:37.455034 5675 W2lListFilesDataset.cpp:141] 1 files found.
I0911 11:03:37.456050 5675 Utils.cpp:102] Filtered 1/1 samples
I0911 11:03:37.456212 5675 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 0
I0911 11:03:37.456913 5685 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 0
I0911 11:03:37.457432 5687 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 1
I0911 11:03:37.457517 5686 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 2
Last lines of strace output (if it can be usefull):
brk(0x5621214aa000) = 0x5621214aa000
brk(0x5621214cb000) = 0x5621214cb000
brk(0x5621214ec000) = 0x5621214ec000
brk(0x56212150d000) = 0x56212150d000
openat(AT_FDCWD, "/root/host/test.lst", O_RDONLY) = 8
read(8, "0: /root/host/5.wav 10000\n", 8191) = 26
read(8, "", 8191) = 0
gettid() = 5692
write(2, "I0911 11:30:35.061928 5692 W2lL"..., 73I0911 11:30:35.061928 5692 W2lListFilesDataset.cpp:141] 1 files found.
) = 73
close(8) = 0
gettid() = 5692
write(2, "I0911 11:30:35.064453 5692 Util"..., 64I0911 11:30:35.064453 5692 Utils.cpp:102] Filtered 1/1 samples
) = 64
gettid() = 5692
write(2, "I0911 11:30:35.066020 5692 W2lL"..., 86I0911 11:30:35.066020 5692 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 0
) = 86
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fc9205ec000
mprotect(0x7fc9205ed000, 8388608, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x7fc920debef0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fc920dec9d0, tls=0x7fc920dec700, child_tidptr=0x7fc920dec9d0) = 5702
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fc91fdeb000
mprotect(0x7fc91fdec000, 8388608, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x7fc9205eaef0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fc9205eb9d0, tls=0x7fc9205eb700, child_tidptr=0x7fc9205eb9d0) = 5703
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fc91f5ea000
mprotect(0x7fc91f5eb000, 8388608, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x7fc91fde9ef0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fc91fdea9d0, tls=0x7fc91fdea700, child_tidptr=0x7fc91fdea9d0) = 5704
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fc91ede9000
mprotect(0x7fc91edea000, 8388608, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x7fc91f5e8ef0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fc91f5e99d0, tls=0x7fc91f5e9700, child_tidptr=0x7fc91f5e99d0) = 5705
futex(0x7fff63119ff8, FUTEX_WAKE_PRIVATE, 1) = 1
I0911 11:30:35.073495 5702 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 0
futex(0x7fff63119ff8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fff63119ff8, FUTEX_WAKE_PRIVATE, 1) = 1
I0911 11:30:35.074316 5704 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 1
futex(0x7fff63119ff8, FUTEX_WAKE_PRIVATE, 1) = 1
I0911 11:30:35.074676 5705 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 2
futex(0x7fff63119ffc, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x7fc920dec9d0, FUTEX_WAIT, 5702, NULL
No CPU usage after this:

Git log:
root@9c696da2978d:~/wav2letter/build# git log
commit 55e9ebc233a001a21c4033aa2a8a60dbf1fe62ec (grafted, HEAD -> master, origin/master)
Author: Tatiana Likhomanenko <[email protected]>
Date: Wed Aug 12 13:08:08 2020 -0700
fix lexicon-free https://github.com/facebookresearch/wav2letter/issues/777; add mosesdecoder version for sota/2019
Summary: title
Reviewed By: vineelpratap
Differential Revision: D23063624
fbshipit-source-id: ffb59e483c5ccbb0c0d8145d7c8afc610c15287a
Hey! In your log you can see that I0911 11:03:37.456050 5675 Utils.cpp:102] Filtered 1/1 samples which means that all samples are filtered. Could you run with --decoder_nthreads=1 --maxisz=1000000 so that you will not filter your sample? Also could you say what is the target transcription length?
I started decoder with --nthread_decoder=1 --maxisz=1000000 options and have the same result:
I0914 08:24:05.279361 112 W2lListFilesDataset.cpp:141] 1 files found.
I0914 08:24:05.279836 112 Utils.cpp:102] Filtered 1/1 samples
I0914 08:24:05.280282 112 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 0
I0914 08:24:05.281785 122 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 0
Also could you say what is the target transcription length? I have no ground transcription for this file.
But I also tried with dev-clean-1272-128104-0000 ./w2l/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac 5.855 mister quilter is the apostle of the middle classes and we are glad to welcome his gospel from librispech.
As flac and converted wav (sox used). They are also filtered.
Could you please advice something?
Could you try with --maxisz=1000000 --maxtsz=1000000 --minisz=0 --minisz=0?
Yes, thank you --minisz=0 helped and it worked 👍
But I noticed, if I use audio without ground truth, file is also filtered. for e.g.
Row from test.lst file:
flac /root/host/flac.wav 5.855 mister quilter is the apostle of the middle classes and we are glad to welcome his gospe -> will work
and
flac /root/host/flac.wav 5.855 --> file will be filtered.
How can I avoid this?
Could you try --minisz=-1? Can you confirm that you see again I0914 08:24:05.279836 112 Utils.cpp:102] Filtered 1/1 samples in this case?
Yes, Filtered 1/1 samples
I've tried --minisz=-1 and my lst file containes:
flac /root/host/flac.wav 5.855 (it is audio from librispeech dataset converted to wav)
Full log
root@5ad90d8a5ec1:~/wav2letter/build# export LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64:$LD_IBRARY_PATH
root@5ad90d8a5ec1:~/wav2letter/build# ./Decoder \
> --flagsfile /root/host/model/decode_500ms_right_future_ngram_other.cfg \
> --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt \
> --lmweight=0.5515838301157 \
> --wordscore=0.52526055643809 \
> --minloglevel=0 \
> --logtostderr=1 \
> --nthread_decoder=1 \
> --maxisz=1000000 \
> --minisz=1
I0921 07:25:27.469930 23 Decode.cpp:58] Reading flags from file /root/host/model/decode_500ms_right_future_ngram_other.cfg
I0921 07:25:27.473843 23 Decode.cpp:75] [Network] Reading acoustic model from /root/host/model/am_500ms_future_context_dev_other.bin
I0921 07:25:30.220180 23 Decode.cpp:79] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> output]
(0): View (-1 80 1 0)
(1): SpecAugment ( W: 80, F: 27, mF: 2, T: 100, p: 1, mT: 2 )
(2): Padding (0, { (5, 3), (0, 0), (0, 0), (0, 0), })
(3): Conv2D (1->15, 10x1, 2,1, 0,0, 1, 1) (with bias)
(4): ReLU
(5): Dropout (0.100000)
(6): LayerNorm ( axis : { 1 2 } , size : -1)
(7): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200]
(8): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200]
(9): Padding (0, { (7, 1), (0, 0), (0, 0), (0, 0), })
(10): Conv2D (15->19, 10x1, 2,1, 0,0, 1, 1) (with bias)
(11): ReLU
(12): Dropout (0.100000)
(13): LayerNorm ( axis : { 1 2 } , size : -1)
(14): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
(15): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
(16): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
(17): Padding (0, { (9, 1), (0, 0), (0, 0), (0, 0), })
(18): Conv2D (19->23, 12x1, 2,1, 0,0, 1, 1) (with bias)
(19): ReLU
(20): Dropout (0.100000)
(21): LayerNorm ( axis : { 1 2 } , size : -1)
(22): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
(23): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
(24): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
(25): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
(26): Padding (0, { (10, 0), (0, 0), (0, 0), (0, 0), })
(27): Conv2D (23->27, 11x1, 1,1, 0,0, 1, 1) (with bias)
(28): ReLU
(29): Dropout (0.100000)
(30): LayerNorm ( axis : { 1 2 } , size : -1)
(31): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(32): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(33): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(34): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(35): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(36): Reorder (2,1,0,3)
(37): View (2160 -1 1 0)
(38): Linear (2160->9998) (with bias)
(39): View (9998 0 -1 1)
I0921 07:25:30.220360 23 Decode.cpp:82] [Criterion] ConnectionistTemporalClassificationCriterion
I0921 07:25:30.220366 23 Decode.cpp:84] [Network] Number of params: 115111823
I0921 07:25:30.220383 23 Decode.cpp:90] [Network] Updating flags from config file: /root/host/model/am_500ms_future_context_dev_other.bin
I0921 07:25:30.225986 23 Decode.cpp:106] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/root/host/model/am_500ms_future_context_dev_other.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=arch.txt; --archdir=; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=8; --beamsize=500; --beamsizetoken=100; --beamthreshold=100; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=/root/host/lists; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/root/host/model/decode_500ms_right_future_ngram_other.cfg; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=1000000; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/root/host/model/decoder/decoder-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0.5515838301157; --localnrmlleftctx=300; --localnrmlrightctx=0; --logadd=false; --lr=0.40000000000000002; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=0.5; --maxisz=1000000; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=1; --minrate=3; --minsil=0; --mintsz=2; --momentum=0; --netoptim=sgd; --noresample=false; --nthread=6; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=1; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=2500; --rightWindowSize=50; --rndv_filepath=/checkpoint/vineelkpratap/experiments/speech/inference_tds//inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8/rndvz.21621542; --rundir=/checkpoint/vineelkpratap/experiments/speech/inference_tds/; --runname=inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=/root/host/sclitedir; --seed=0; --show=true; --showletters=true; --silscore=0; --smearing=max; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=; --tag=; --target=tkn; --test=3test.lst; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/root/host/model/am; --train=/checkpoint/antares/datasets/librispeech/lists/train-clean-100.lst,/checkpoint/antares/datasets/librispeech/lists/train-clean-360.lst,/checkpoint/antares/datasets/librispeech/lists/train-other-500.lst,/checkpoint/vineelkpratap/experiments/speech/librivox.cut.sub36s.datasets.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=/checkpoint/antares/datasets/librispeech/lists/dev-clean.lst,/checkpoint/antares/datasets/librispeech/lists/dev-other.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0.52526055643809; --wordseparator=_; --world_rank=0; --world_size=32; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I0921 07:25:30.236332 23 Decode.cpp:127] Number of classes (network): 9998
I0921 07:25:32.854022 23 Decode.cpp:134] Number of words: 200001
I0921 07:25:33.202980 23 Decode.cpp:247] [Decoder] LM constructed.
I0921 07:25:37.651134 23 Decode.cpp:274] [Decoder] Trie planted.
I0921 07:25:38.087067 23 Decode.cpp:286] [Decoder] Trie smeared.
I0921 07:25:38.992707 23 W2lListFilesDataset.cpp:141] 1 files found.
I0921 07:25:38.993213 23 Utils.cpp:102] Filtered 1/1 samples
I0921 07:25:38.993324 23 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 0
I0921 07:25:38.995147 33 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 0
cc @vineelpratap is there any option to accept empty transcriptions?
Could you use --mintsz=-1 and see if it helps
Yes,
Filtered 1/1 samplesI've tried
--minisz=-1and my lst file containes:flac /root/host/flac.wav 5.855(it is audio from librispeech dataset converted to wav)Full log
root@5ad90d8a5ec1:~/wav2letter/build# export LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64:$LD_IBRARY_PATH root@5ad90d8a5ec1:~/wav2letter/build# ./Decoder \ > --flagsfile /root/host/model/decode_500ms_right_future_ngram_other.cfg \ > --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt \ > --lmweight=0.5515838301157 \ > --wordscore=0.52526055643809 \ > --minloglevel=0 \ > --logtostderr=1 \ > --nthread_decoder=1 \ > --maxisz=1000000 \ > --minisz=1 I0921 07:25:27.469930 23 Decode.cpp:58] Reading flags from file /root/host/model/decode_500ms_right_future_ngram_other.cfg I0921 07:25:27.473843 23 Decode.cpp:75] [Network] Reading acoustic model from /root/host/model/am_500ms_future_context_dev_other.bin I0921 07:25:30.220180 23 Decode.cpp:79] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> output] (0): View (-1 80 1 0) (1): SpecAugment ( W: 80, F: 27, mF: 2, T: 100, p: 1, mT: 2 ) (2): Padding (0, { (5, 3), (0, 0), (0, 0), (0, 0), }) (3): Conv2D (1->15, 10x1, 2,1, 0,0, 1, 1) (with bias) (4): ReLU (5): Dropout (0.100000) (6): LayerNorm ( axis : { 1 2 } , size : -1) (7): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200] (8): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200] (9): Padding (0, { (7, 1), (0, 0), (0, 0), (0, 0), }) (10): Conv2D (15->19, 10x1, 2,1, 0,0, 1, 1) (with bias) (11): ReLU (12): Dropout (0.100000) (13): LayerNorm ( axis : { 1 2 } , size : -1) (14): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520] (15): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520] (16): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520] (17): Padding (0, { (9, 1), (0, 0), (0, 0), (0, 0), }) (18): Conv2D (19->23, 12x1, 2,1, 0,0, 1, 1) (with bias) (19): ReLU (20): Dropout (0.100000) (21): LayerNorm ( axis : { 1 2 } , size : -1) (22): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840] (23): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840] (24): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840] (25): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840] (26): Padding (0, { (10, 0), (0, 0), (0, 0), (0, 0), }) (27): Conv2D (23->27, 11x1, 1,1, 0,0, 1, 1) (with bias) (28): ReLU (29): Dropout (0.100000) (30): LayerNorm ( axis : { 1 2 } , size : -1) (31): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160] (32): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160] (33): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160] (34): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160] (35): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160] (36): Reorder (2,1,0,3) (37): View (2160 -1 1 0) (38): Linear (2160->9998) (with bias) (39): View (9998 0 -1 1) I0921 07:25:30.220360 23 Decode.cpp:82] [Criterion] ConnectionistTemporalClassificationCriterion I0921 07:25:30.220366 23 Decode.cpp:84] [Network] Number of params: 115111823 I0921 07:25:30.220383 23 Decode.cpp:90] [Network] Updating flags from config file: /root/host/model/am_500ms_future_context_dev_other.bin I0921 07:25:30.225986 23 Decode.cpp:106] Gflags after parsing --flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/root/host/model/am_500ms_future_context_dev_other.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=arch.txt; --archdir=; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=8; --beamsize=500; --beamsizetoken=100; --beamthreshold=100; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=/root/host/lists; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/root/host/model/decode_500ms_right_future_ngram_other.cfg; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=1000000; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/root/host/model/decoder/decoder-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0.5515838301157; --localnrmlleftctx=300; --localnrmlrightctx=0; --logadd=false; --lr=0.40000000000000002; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=0.5; --maxisz=1000000; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=1; --minrate=3; --minsil=0; --mintsz=2; --momentum=0; --netoptim=sgd; --noresample=false; --nthread=6; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=1; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=2500; --rightWindowSize=50; --rndv_filepath=/checkpoint/vineelkpratap/experiments/speech/inference_tds//inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8/rndvz.21621542; --rundir=/checkpoint/vineelkpratap/experiments/speech/inference_tds/; --runname=inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=/root/host/sclitedir; --seed=0; --show=true; --showletters=true; --silscore=0; --smearing=max; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=; --tag=; --target=tkn; --test=3test.lst; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/root/host/model/am; --train=/checkpoint/antares/datasets/librispeech/lists/train-clean-100.lst,/checkpoint/antares/datasets/librispeech/lists/train-clean-360.lst,/checkpoint/antares/datasets/librispeech/lists/train-other-500.lst,/checkpoint/vineelkpratap/experiments/speech/librivox.cut.sub36s.datasets.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=/checkpoint/antares/datasets/librispeech/lists/dev-clean.lst,/checkpoint/antares/datasets/librispeech/lists/dev-other.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0.52526055643809; --wordseparator=_; --world_rank=0; --world_size=32; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=; I0921 07:25:30.236332 23 Decode.cpp:127] Number of classes (network): 9998 I0921 07:25:32.854022 23 Decode.cpp:134] Number of words: 200001 I0921 07:25:33.202980 23 Decode.cpp:247] [Decoder] LM constructed. I0921 07:25:37.651134 23 Decode.cpp:274] [Decoder] Trie planted. I0921 07:25:38.087067 23 Decode.cpp:286] [Decoder] Trie smeared. I0921 07:25:38.992707 23 W2lListFilesDataset.cpp:141] 1 files found. I0921 07:25:38.993213 23 Utils.cpp:102] Filtered 1/1 samples I0921 07:25:38.993324 23 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 0 I0921 07:25:38.995147 33 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 0
I see in the loog that mintsz=2 not -1, could you try to fix it?
@tlikhomanenko Thank you for your comment. Yes, you-re right.
I did mintsz=-1 and file is not filtered. But decoder is still hangs.
root@6ca99de58c9c:~/wav2letter/build# ./Decoder \
> --flagsfile /root/host/model/decode_500ms_right_future_ngram_other.cfg \
> --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt \
> --lmweight=0.5515838301157 \
> --wordscore=0.52526055643809 \
> --minloglevel=0 \
> --logtostderr=1 \
> --nthread_decoder=1 \
> --maxisz=100000 \
> --minisz=1 \
> --mintsz=-1
I0929 10:45:00.953776 133 Decode.cpp:58] Reading flags from file /root/host/model/decode_500ms_right_future_ngram_other.cfg
I0929 10:45:00.957911 133 Decode.cpp:75] [Network] Reading acoustic model from /root/host/model/am_500ms_future_context_dev_other.bin
I0929 10:45:02.714236 133 Decode.cpp:79] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> output]
(0): View (-1 80 1 0)
(1): SpecAugment ( W: 80, F: 27, mF: 2, T: 100, p: 1, mT: 2 )
(2): Padding (0, { (5, 3), (0, 0), (0, 0), (0, 0), })
(3): Conv2D (1->15, 10x1, 2,1, 0,0, 1, 1) (with bias)
(4): ReLU
(5): Dropout (0.100000)
(6): LayerNorm ( axis : { 1 2 } , size : -1)
(7): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200]
(8): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200]
(9): Padding (0, { (7, 1), (0, 0), (0, 0), (0, 0), })
(10): Conv2D (15->19, 10x1, 2,1, 0,0, 1, 1) (with bias)
(11): ReLU
(12): Dropout (0.100000)
(13): LayerNorm ( axis : { 1 2 } , size : -1)
(14): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
(15): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
(16): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
(17): Padding (0, { (9, 1), (0, 0), (0, 0), (0, 0), })
(18): Conv2D (19->23, 12x1, 2,1, 0,0, 1, 1) (with bias)
(19): ReLU
(20): Dropout (0.100000)
(21): LayerNorm ( axis : { 1 2 } , size : -1)
(22): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
(23): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
(24): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
(25): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
(26): Padding (0, { (10, 0), (0, 0), (0, 0), (0, 0), })
(27): Conv2D (23->27, 11x1, 1,1, 0,0, 1, 1) (with bias)
(28): ReLU
(29): Dropout (0.100000)
(30): LayerNorm ( axis : { 1 2 } , size : -1)
(31): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(32): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(33): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(34): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(35): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(36): Reorder (2,1,0,3)
(37): View (2160 -1 1 0)
(38): Linear (2160->9998) (with bias)
(39): View (9998 0 -1 1)
I0929 10:45:02.714473 133 Decode.cpp:82] [Criterion] ConnectionistTemporalClassificationCriterion
I0929 10:45:02.714479 133 Decode.cpp:84] [Network] Number of params: 115111823
I0929 10:45:02.714498 133 Decode.cpp:90] [Network] Updating flags from config file: /root/host/model/am_500ms_future_context_dev_other.bin
I0929 10:45:02.718485 133 Decode.cpp:106] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/root/host/model/am_500ms_future_context_dev_other.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=arch.txt; --archdir=; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=8; --beamsize=500; --beamsizetoken=100; --beamthreshold=100; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=/root/host/lists; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/root/host/model/decode_500ms_right_future_ngram_other.cfg; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=1000000; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/root/host/model/decoder/decoder-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0.5515838301157; --localnrmlleftctx=300; --localnrmlrightctx=0; --logadd=false; --lr=0.40000000000000002; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=0.5; --maxisz=100000; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=1; --minrate=3; --minsil=0; --mintsz=-1; --momentum=0; --netoptim=sgd; --noresample=false; --nthread=6; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=1; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=2500; --rightWindowSize=50; --rndv_filepath=/checkpoint/vineelkpratap/experiments/speech/inference_tds//inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8/rndvz.21621542; --rundir=/checkpoint/vineelkpratap/experiments/speech/inference_tds/; --runname=inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=/root/host/sclitedir; --seed=0; --show=true; --showletters=true; --silscore=0; --smearing=max; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=; --tag=; --target=tkn; --test=1test.lst; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/root/host/model/am; --train=/checkpoint/antares/datasets/librispeech/lists/train-clean-100.lst,/checkpoint/antares/datasets/librispeech/lists/train-clean-360.lst,/checkpoint/antares/datasets/librispeech/lists/train-other-500.lst,/checkpoint/vineelkpratap/experiments/speech/librivox.cut.sub36s.datasets.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=/checkpoint/antares/datasets/librispeech/lists/dev-clean.lst,/checkpoint/antares/datasets/librispeech/lists/dev-other.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0.52526055643809; --wordseparator=_; --world_rank=0; --world_size=32; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I0929 10:45:02.730916 133 Decode.cpp:127] Number of classes (network): 9998
I0929 10:45:04.877595 133 Decode.cpp:134] Number of words: 200001
I0929 10:45:05.131882 133 Decode.cpp:247] [Decoder] LM constructed.
I0929 10:45:08.631345 133 Decode.cpp:274] [Decoder] Trie planted.
I0929 10:45:09.117630 133 Decode.cpp:286] [Decoder] Trie smeared.
I0929 10:45:10.147207 133 W2lListFilesDataset.cpp:141] 1 files found.
I0929 10:45:10.148475 133 Utils.cpp:102] Filtered 0/1 samples
I0929 10:45:10.149003 133 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 1
I0929 10:45:10.154240 143 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 0
@mironnn Can you please help me to use wav2letter to for inferencing sample audio file. I would like to know from where you downloaded the pre trained models and other files such as lexicon and token files. Basically I need information on the below.
|-- model
| |-- 3-gram.pruned.3e-7.bin.qt
| |-- am
| | `-- librispeech-train-all-unigram-10000.tokens
| |-- am_500ms_future_context.arch
| |-- am_500ms_future_context_dev_other.bin
| |-- decode_500ms_right_future_ngram_other.cfg
| `-- decoder
| `-- decoder-unigram-10000-nbest10.lexicon
Please help me regarding this. Thank you
Regards, Manoj
@mironnn Can you please help me to use wav2letter to for inferencing sample audio file. I would like to know from where you downloaded the pre trained models and other files such as lexicon and token files. Basically I need information on the below.
|-- model | |-- 3-gram.pruned.3e-7.bin.qt | |-- am | | `-- librispeech-train-all-unigram-10000.tokens | |-- am_500ms_future_context.arch | |-- am_500ms_future_context_dev_other.bin | |-- decode_500ms_right_future_ngram_other.cfg | `-- decoder | `-- decoder-unigram-10000-nbest10.lexiconPlease help me regarding this. Thank you
Regards, Manoj
All you can find here https://github.com/facebookresearch/wav2letter/tree/master/recipes/streaming_convnets/librispeech
@mironnn could you confirm that if transcription is not empty decoder is working for you? if so we will fix the issue with empty transcription in future (decoder hangs because there is exception which is not throwing correctly to finish the program, this we are fixing already).
Please use for now non-empty transcription, sorry for inconvenience!
Hi,
Please make sure you are using --show flag.
Another possible scenario where this can happen is when there is an error while decoding inside the threadpool. You can try doing this change in Decode.cpp and Test.cpp to make sure the error is shown in the logs.
#before
threadPool.enqueue(...);
#after
auto fut = threadPool.enqueue(...);
fut.get();