wav2letter
wav2letter copied to clipboard
Difference between sota/2019/am_tds_ctc and streaming_convnets/librispeech/am_500ms_future_context models?
Question
Hi,
Other than the architecture what is the difference between sota/2019/am_tds_ctc and streaming_convnets/librispeech/am_500ms_future_context models?
I am able to convert the latter to FBGEMM streaming convnet using the conversion tool however, I got the following error when I tried converting the former:
I1115 13:35:06.643517 7721 StreamingTDSModelConverter.cpp:152] [Network] Reading acoustic model from /home/w2luser/w2l/am/am_tds_ctc_librispeech_dev_other.bin
I1115 13:35:07.701886 7721 StreamingTDSModelConverter.cpp:157] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> output]
(0): SpecAugment ( W: 80, F: 27, mF: 2, T: 100, p: 1, mT: 2 )
(1): View (-1 80 1 0)
(2): Conv2D (1->10, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
(3): ReLU
(4): Dropout (0.000000)
(5): LayerNorm ( axis : { 0 1 2 } , size : -1)
(6): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
(7): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
(8): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
(9): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
(10): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
(11): Conv2D (10->14, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
(12): ReLU
(13): Dropout (0.000000)
(14): LayerNorm ( axis : { 0 1 2 } , size : -1)
(15): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(16): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(17): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(18): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(19): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(20): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(21): Conv2D (14->18, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
(22): ReLU
(23): Dropout (0.000000)
(24): LayerNorm ( axis : { 0 1 2 } , size : -1)
(25): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(26): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(27): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(28): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(29): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(30): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(31): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(32): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(33): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(34): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(35): View (0 1440 1 0)
(36): Reorder (1,0,3,2)
(37): Linear (1440->9998) (with bias)
I1115 13:35:07.702139 7721 StreamingTDSModelConverter.cpp:158] [Criterion] ConnectionistTemporalClassificationCriterion
I1115 13:35:07.702153 7721 StreamingTDSModelConverter.cpp:159] [Network] Number of params: 203394122
I1115 13:35:07.702214 7721 StreamingTDSModelConverter.cpp:165] [Network] Updating flags from config file: /home/w2luser/w2l/am/am_tds_ctc_librispeech_dev_other.bin
I1115 13:35:07.702975 7721 StreamingTDSModelConverter.cpp:174] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/home/w2luser/w2l/am/am_tds_ctc_librispeech_dev_other.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=am_arch/am_tds_ctc.arch; --archdir=/home/w2luser/Projects/wav2letter/recipes/models/sota/2019; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=4; --beamsize=2500; --beamsizetoken=250000; --beamthreshold=25; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/home/w2luser/Projects/wav2letter/recipes/models/sota/2019/librispeech/train_am_tds_ctc.cfg; --framesizems=30; --framestridems=10; --gamma=0.5; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=1500; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/home/w2luser/w2l/am/librispeech-train+dev-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0; --localnrmlleftctx=0; --localnrmlrightctx=0; --logadd=false; --lr=0.29999999999999999; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=1; --maxisz=9223372036854775807; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=8338608; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz=0; --momentum=0.5; --netoptim=sgd; --noresample=false; --nthread=10; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=100; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=0; --rightWindowSize=50; --rndv_filepath=/checkpoint/qiantong/ls_200M/do0.15_l5.6.10_mid3.0_incDO/100_rndv; --rundir=[...]; --runname=am_tds_ctc_librispeech; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=; --seed=2; --show=false; --showletters=false; --silscore=0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=200; --surround=; --tag=; --target=ltr; --test=; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/home/w2luser/w2l/am; --train=[DATA_DST]/lists/train-clean-100.lst,[DATA_DST]/lists/train-clean-360.lst,[DATA_DST]/lists/train-other-500.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=dev-clean:[DATA_DST]/lists/dev-clean.lst,dev-other:[DATA_DST]/lists/dev-other.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0; --wordseparator=_; --world_rank=0; --world_size=64; --outdir=/home/w2luser/models; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I1115 13:35:07.736187 7721 StreamingTDSModelConverter.cpp:192] Number of classes (network): 9998
Skipping SpecAugment module: SAUG 80 27 2 100 1.0 2
Skipping View module: V -1 NFEAT 1 0
Skipping Dropout module: DO 0.0
F1115 13:35:07.754354 7721 StreamingTDSModelConverter.cpp:246] Unsupported LayerNorm axis: must be {1, 2} for streaming
*** Check failure stack trace: ***
@ 0x7fe42662e1c3 google::LogMessage::Fail()
@ 0x7fe42663325b google::LogMessage::SendToLog()
@ 0x7fe42662debf google::LogMessage::Flush()
@ 0x7fe42662e6ef google::LogMessageFatal::~LogMessageFatal()
@ 0x561167388a79 main
@ 0x7fe425fd6cb2 __libc_start_main
@ 0x561167386ade _start
I was under an impression that any TDS CTC model could be converted to FBGEMM streaming convnets.
Thanks!
Hi,
To make the architecture streamable, you would have to make changed to TDS+CTC architecture. Using plain TDS+CTC
architecture won't work for streaming use case...
Here are the main changes ...
- Remove normalization over time in modules -
LN
,TDS
- See the changed architecture file instreaming_convnets
recipe -
--localnrmlleftctx=300
Thanks, @vineelpratap.
- For
LN
(LayerNorm
), can I simply remove the time dimension and reuse the parameters of the rest of the other layers as is? - It seems providing
--localnrmlleftctx=300
is moot forTDS+CTC
architecture sinceLocalNorm
(not to be confused withLayerNorm
) isn't used anywhere in the model. Is my understanding correct?
I did the above two (converted LN 0 1 2
to LN 1 2
in the archfile and provided --localnrmlleftctx=300
in the config file) and ran the streaming TDS module conversion script and was able to obtain an acoustic_module.bin
, however, I get the following error. It looks like the output from Flashlight and FBGEMM model isn't matching.
What additional changes need to be done?
Thanks!
/home/w2luser/Projects/wav2letter/cmake-build-debug-fbgemm/tools/streaming_tds_model_converter --am /data/podcaster/model/wav2letter/am_tds_ctc_librispeech_dev_other/am_tds_ctc_librispeech_dev_other.bin --outdir /home/w2luser/models --flagsfile /home/w2luser/Projects/wav2letter/recipes/models/sota/2019/librispeech/train_am_tds_ctc.cfg --logtostderr=1
I1117 14:52:10.108525 53902 StreamingTDSModelConverter.cpp:152] [Network] Reading acoustic model from /data/podcaster/model/wav2letter/am_tds_ctc_librispeech_dev_other/am_tds_ctc_librispeech_dev_other.bin
I1117 14:52:10.856041 53902 StreamingTDSModelConverter.cpp:157] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> output]
(0): SpecAugment ( W: 80, F: 27, mF: 2, T: 100, p: 1, mT: 2 )
(1): View (-1 80 1 0)
(2): Conv2D (1->10, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
(3): ReLU
(4): Dropout (0.000000)
(5): LayerNorm ( axis : { 0 1 2 } , size : -1)
(6): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
(7): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
(8): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
(9): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
(10): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
(11): Conv2D (10->14, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
(12): ReLU
(13): Dropout (0.000000)
(14): LayerNorm ( axis : { 0 1 2 } , size : -1)
(15): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(16): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(17): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(18): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(19): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(20): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(21): Conv2D (14->18, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
(22): ReLU
(23): Dropout (0.000000)
(24): LayerNorm ( axis : { 0 1 2 } , size : -1)
(25): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(26): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(27): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(28): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(29): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(30): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(31): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(32): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(33): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(34): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(35): View (0 1440 1 0)
(36): Reorder (1,0,3,2)
(37): Linear (1440->9998) (with bias)
I1117 14:52:10.856158 53902 StreamingTDSModelConverter.cpp:158] [Criterion] ConnectionistTemporalClassificationCriterion
I1117 14:52:10.856165 53902 StreamingTDSModelConverter.cpp:159] [Network] Number of params: 203394122
I1117 14:52:10.856205 53902 StreamingTDSModelConverter.cpp:165] [Network] Updating flags from config file: /data/podcaster/model/wav2letter/am_tds_ctc_librispeech_dev_other/am_tds_ctc_librispeech_dev_other.bin
I1117 14:52:10.856637 53902 StreamingTDSModelConverter.cpp:174] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/data/podcaster/model/wav2letter/am_tds_ctc_librispeech_dev_other/am_tds_ctc_librispeech_dev_other.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=am_arch/am_tds_ctc.arch; --archdir=/home/w2luser/Projects/wav2letter/recipes/models/sota/2019; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=4; --beamsize=2500; --beamsizetoken=250000; --beamthreshold=25; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/home/w2luser/Projects/wav2letter/recipes/models/sota/2019/librispeech/train_am_tds_ctc.cfg; --framesizems=30; --framestridems=10; --gamma=0.5; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=1500; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/home/w2luser/w2l/am/librispeech-train+dev-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0; --localnrmlleftctx=300; --localnrmlrightctx=0; --logadd=false; --lr=0.29999999999999999; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=1; --maxisz=9223372036854775807; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=8338608; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz=0; --momentum=0.5; --netoptim=sgd; --noresample=false; --nthread=10; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=100; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=0; --rightWindowSize=50; --rndv_filepath=/checkpoint/qiantong/ls_200M/do0.15_l5.6.10_mid3.0_incDO/100_rndv; --rundir=[...]; --runname=am_tds_ctc_librispeech; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=; --seed=2; --show=false; --showletters=false; --silscore=0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=200; --surround=; --tag=; --target=ltr; --test=; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/home/w2luser/w2l/am; --train=[DATA_DST]/lists/train-clean-100.lst,[DATA_DST]/lists/train-clean-360.lst,[DATA_DST]/lists/train-other-500.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=dev-clean:[DATA_DST]/lists/dev-clean.lst,dev-other:[DATA_DST]/lists/dev-other.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0; --wordseparator=_; --world_rank=0; --world_size=64; --outdir=/home/w2luser/models; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I1117 14:52:10.876313 53902 StreamingTDSModelConverter.cpp:192] Number of classes (network): 9998
Skipping SpecAugment module: SAUG 80 27 2 100 1.0 2
Skipping View module: V -1 NFEAT 1 0
Skipping Dropout module: DO 0.0
Skipping Dropout module: DO 0.0
Skipping Dropout module: DO 0.0
Skipping View module: V 0 1440 1 0
Skipping Reorder module: RO 1 0 3 2
I1117 14:52:26.342659 53902 StreamingTDSModelConverter.cpp:289] Serializing acoustic model to '/home/w2luser/models/acoustic_model.bin'
I1117 14:52:36.974776 53902 StreamingTDSModelConverter.cpp:301] Writing tokens file to '/home/w2luser/models/tokens.txt'
I1117 14:52:36.977149 53902 StreamingTDSModelConverter.cpp:328] Serializing feature extraction model to '/home/w2luser/models/feature_extractor.bin'
I1117 14:52:36.980671 53902 StreamingTDSModelConverter.cpp:344] verifying serialization ...
F1117 14:52:37.219713 53902 StreamingTDSModelConverter.cpp:368] [Serialization Error] Mismatched output w2l:2.72653 vs streaming:12.5302
*** Check failure stack trace: ***
@ 0x7f4f9d8441c3 google::LogMessage::Fail()
@ 0x7f4f9d84925b google::LogMessage::SendToLog()
@ 0x7f4f9d843ebf google::LogMessage::Flush()
@ 0x7f4f9d8446ef google::LogMessageFatal::~LogMessageFatal()
@ 0x55f014b84301 main
@ 0x7f4f9d1eccb2 __libc_start_main
@ 0x55f014b80ade _start
Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
Hey @vineelpratap,
Do you have any advice with regards to the above?
Thanks!
hi @abhinavkulkarni
- Layernorm with time or not will be change the result of model (because pretrained model has trained with layernorm with time) but you can use it (layernorm without time) normaly because it's not change the shape of output after this layer. But your error may be from w2l check result after model after serialization with normal (it's sure be diffference) and make disurpt. If you still want to use layernorm without time i think just edit code to skip this check although its just a tricked and result will be get still bad.
- option --localnrmlleftctx=300 will be used when extract feature (norm mfsc) from code: https://github.com/facebookresearch/wav2letter/blob/v0.2/src/data/Featurize.cpp#L106 so like this answer in 1, it's will make difference result because your model train with feature norm with left context, and when you change it your feature you generated will difference like your model has trained before. conclude, so if you still want to use am_tds_ctc, it will not be good for streaming right now.