wav2letter Model not converging - Resnet architecture

trafficstars

I have 200 hours of training set and 20 hours of validation set. Loss is infinity even after 44 epochs. I tried with different learning rates but no difference. Here is the arch:

SAUG 80 27 2 100 1.0 2 V -1 1 NFEAT 0 C NFEAT 512 3 2 -1 1 R DO 0.15 LN 0 1 2 block1 kernel 5 RES 9 1 1 C 512 512 3 1 -1 1 R DO 0.15 LN 0 1 2 C 512 512 3 1 -1 1 R DO 0.15 LN 0 1 2 C 512 512 3 1 -1 1 SKIP 0 10 0.70711 R DO 0.15 LN 0 1 2 block1 kernel 5 RES 9 1 1 C 512 512 3 1 -1 1 R DO 0.15 LN 0 1 2 C 512 512 3 1 -1 1 R DO 0.15 LN 0 1 2 C 512 512 3 1 -1 1 SKIP 0 10 0.70711 R DO 0.15 LN 0 1 2 M 2 1 2 1 block1 kernel 5 RES 9 1 1 C 512 512 3 1 -1 1 R DO 0.15 LN 0 1 2 C 512 512 3 1 -1 1 R DO 0.15 LN 0 1 2 C 512 512 3 1 -1 1 SKIP 0 10 0.70711 R DO 0.15 LN 0 1 2 block1 kernel 5 RES 9 1 1 C 512 512 3 1 -1 1 R DO 0.15 LN 0 1 2 C 512 512 3 1 -1 1 R DO 0.15 LN 0 1 2 C 512 512 3 1 -1 1 SKIP 0 10 0.70711 R DO 0.15 LN 0 1 2 block1 kernel 5 RES 9 1 1 C 512 512 3 1 -1 1 R DO 0.15 LN 0 1 2 C 512 512 3 1 -1 1 R DO 0.15 LN 0 1 2 C 512 512 3 1 -1 1 SKIP 0 10 0.70711 R DO 0.15 LN 0 1 2 block1 kernel 5 RES 9 1 1 C 512 512 3 1 -1 1 R DO 0.15 LN 0 1 2 C 512 512 3 1 -1 1 R DO 0.15 LN 0 1 2 C 512 512 3 1 -1 1 SKIP 0 10 0.70711 R DO 0.15 LN 0 1 2 M 2 1 2 1 block1 kernel 5 RES 9 1 1 C 512 512 3 1 -1 1 R DO 0.15 LN 0 1 2 C 512 512 3 1 -1 1 R DO 0.15 LN 0 1 2 C 512 512 3 1 -1 1 SKIP 0 10 0.70711 R DO 0.15 LN 0 1 2 C 512 512 3 1 -1 1 R DO 0.15 LN 0 1 2 block1 kernel 5 RES 9 1 1 C 512 512 3 1 -1 1 R DO 0.20 LN 0 1 2 C 512 512 3 1 -1 1 R DO 0.20 LN 0 1 2 C 512 512 3 1 -1 1 SKIP 0 10 0.70711 R DO 0.20 LN 0 1 2 block1 kernel 5 RES 9 1 1 C 512 512 3 1 -1 1 R DO 0.20 LN 0 1 2 C 512 512 3 1 -1 1 R DO 0.20 LN 0 1 2 C 512 512 3 1 -1 1 SKIP 0 10 0.70711 R DO 0.20 LN 0 1 2 C 512 1024 3 1 -1 1 R DO 0.20 LN 0 1 2 M 2 1 2 1 block1 kernel 5 RES 9 1 1 C 1024 1024 3 1 -1 1 R DO 0.25 LN 0 1 2 C 1024 1024 3 1 -1 1 R DO 0.25 LN 0 1 2 C 1024 1024 3 1 -1 1 SKIP 0 10 0.70711 R DO 0.25 LN 0 1 2 C 1024 1024 3 1 -1 1 R LN 0 1 2 DO 0.25 C 1024 NLABEL 1 1 -1 1 RO 2 0 3 1

Here is the log file: logfile.txt

The train.cfg file:

--runname=resnet_v2_1 --rundir=/data/ahnaf/wav2letter/dataset_prep/all_models/ --datadir=/data/ahnaf/wav2letter/dataset_prep/ --tokensdir=/data/ahnaf/wav2letter/dataset_prep/ --train=train_updated.lst --valid=validation.lst --lexicon=/data/ahnaf/wav2letter/dataset_prep/lexicon.txt --input=wav --tokens=tokens_normal.txt --archdir=/data/ahnaf/wav2letter/dataset_prep/ --arch=network_backup.arch --criterion=ctc --mfsc --lr=0.4 --lrcrit=0.006 --lrcosine --onorm=target --sqnorm --momentum=0.6 --maxgradnorm=1 --nthread=7 --batchsize=4 --filterbanks=80 --iter=800000 --reportiters=0 --logtostderr --enable_distributed --warmup=0

Then I used the same arch to train only 20 hours of data to get the model overfitted. The validation set was 20 minutes. Upto epoch 40, training loss is infinity and TER and WER is like before. Instead of lrcosine, I tried to reduce the learning rate slowly but to no avail ( I ran 25 epochs though). Using lrcosinge, once learning rate becomes zero, it doesn't increase anymore.

Jun 06 '20 21:06 samin9796

cc @tlikhomanenko @jacobkahn

Jun 15 '20 01:06 avidov

@samin9796

Can you try to run at first without specaug and with --lrcosine=false and train with constant lr. Please add also log of training where params are printed to check all your settings (your lr decreasing is definitely wrong, some parameter possibly influences).

About your loss, once it is inf it cannot restore. Can you set --reportiters=100 for debugging to see what is happening at the beginning?

Jun 15 '20 20:06 tlikhomanenko

@tlikhomanenko I set lr value to 0.1 and trained without specaug and lrcosine. However, model is still not converging. Here is the log: 001_log.txt This is the config file: config.txt

Jun 17 '20 20:06 samin9796

Are you training letter-based acoustic model? Could you show me head of your tokens set and lexicon file?

Jun 18 '20 04:06 tlikhomanenko

@tlikhomanenko Yes, I am training character based acoustic model. This is lexicon file:

বাগান ব া গ া ন | লেখাই ল ে খ া ই | জন্মায়নি জ ন ্ ম া য ় ন ি | রুনুর র ু ন ু র | ফ্র্যাঞ্চাইজি ফ ্ র ্ য া ঞ ্ চ া ই জ ি | রাসবিহারী র া স ব ি হ া র ী | লাটু ল া ট ু | বুকিটে ব ু ক ি ট ে | পেশা প ে শ া | ঔষধের ঔ ষ ধ ে র |

Token set:

| ' _ অ আ ই ঈ উ ঊ ঋ এ ঐ

Jun 18 '20 05:06 samin9796

Could you run with --warmup=1 --reporiters=1 --surround=|?

Jun 18 '20 05:06 tlikhomanenko

@tlikhomanenko I ran as you said. Here is the log file: 001_log.txt I got loss value as inf at some iterations and eventually loss NaN error. Tried out different lr from 0.0001 to 0.4 but still getting this error. However, I am training another model with another dataset and this time not getting loss value as inf. The first dataset has audio of 10-15 sec in duration and the latter has 2-5 sec audio files.

Jun 19 '20 14:06 samin9796

Possibly you have problems with data itself. You could try to filter them with minisz, maxisz, mintsz, maxtsz. Duration 10-15s should be fine as well as 2-5 sec. We trained our models up to 36s duration for Librivox, for example. One thing to check is that all audio has the same format. Also you can try to run simpler model (less number of layers) to see if the error still persists.

Jun 20 '20 00:06 tlikhomanenko

wav2letter wav2letter copied to clipboard

Model not converging - Resnet architecture

wav2letter
wav2letter copied to clipboard