wav2letter Help to solve training error

trafficstars

Hello,

I was training a new model. But after around 100000 iterations (maybe still in the first epoch), I got the fallowing error message.

*** Aborted at 1590963336 (unix time) try "date -d @1590963336" if you are using GNU date *** PC: @ 0x7f4bfeb7707d cuda::evalNodes<>() *** SIGFPE (@0x7f4bfeb7707d) received by PID 24362 (TID 0x7f4c29fed380) from PID 18446744073688019069; stack trace: *** @ 0x7f4c223ba390 (unknown) @ 0x7f4bfeb7707d cuda::evalNodes<>() @ 0x7f4bfeb77cbf cuda::evalNodes<>() @ 0x7f4bfe6bcaea cuda::Array<>::eval() @ 0x7f4bfd02f851 _ZN4cuda10reduce_allIL7af_op_t5EccEET1_RKNS_5ArrayIT0_EEbd @ 0x7f4bff585783 af_any_true_all @ 0x7f4bff75d584 af::anyTrue<>() @ 0x493640 _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEES_INS0_19FirstOrderOptimizerEES9_ddblE3_clES2_S5_S7_S9_S9_ddbl.constprop.12679 @ 0x41c3b7 main @ 0x7f4bdf447830 __libc_start_main @ 0x48e5d9 _start @ 0x0 (unknown) Floating point exception (core dumped)

Could you please help me to figure out where I was doing wrong? Thanks a lot!

Best, Ling

Jun 06 '20 04:06 Hui-Ling

Could you send your train config? How many GPUs do you use to run? What is the dataset size?

Jun 06 '20 05:06 tlikhomanenko

The following is the train config: --datadir=/media/ubuntu/HDD2/wav2letter/Proj --rundir=/media/ubuntu/HDD2/wav2letter/Proj/models --archdir=/media/ubuntu/HDD2/wav2letter/Proj --train=lists/train_step2.csv --valid=lists/dev_step2.csv --input=wav --arch=network.arch --tokens=/media/ubuntu/HDD2/wav2letter/Proj/am/tokens2.txt --lexicon=/media/ubuntu/HDD2/wav2letter/Proj/am/lexicon2.txt --criterion=ctc --lr=0.0001 --lrcrit=0.0001 --maxgradnorm=1.0 --replabel=1 --surround=| --onorm=target --sqnorm=true --mfsc=true --filterbanks=40 --nthread=4 --batchsize=4 --runname=trainlogs --iter=1196042 --reportiters=20000

I used only one GPU (titanXP). The training set has 478415 samples.

Thanks!

Jun 06 '20 06:06 Hui-Ling

your learning rate is too small for sgd (from my experience), try larger. About the corruption - try to run on, say, 1000 samples, probably the problem in the validation set. Do you see the same error?

Jun 06 '20 18:06 tlikhomanenko

Probably this, https://github.com/facebookresearch/wav2letter/issues/709#issue-641807070

Jun 19 '20 08:06 viig99

@viig99 the gamma is not used here (by default it is 1), so no lr decaying happening here.

Jun 20 '20 00:06 tlikhomanenko

I am getting the same error trying to train TDS with CTC. I run the training command and after a few seconds it is interrupted, reporting said problem. I think it's worth noting I'm running on docker and the image passes all 31 tests.

Here is my train configuration:

And here is my nvidia-smi output:

This is the first time I had the problem. Already tried LexFree and Transformer models and no such error appeared.

Any suggestions?

Aug 11 '20 11:08 Bernardo-Favoreto

@Bernardo-Favoreto what exactly error do you have? Could you post here your log?

Aug 12 '20 04:08 tlikhomanenko

Sure @tlikhomanenko, here it is: *** Aborted at 1590963336 (unix time) try "date -d @1590963336" if you are using GNU date ***

PC: @ 0x7f4bfeb7707d cuda::evalNodes<>()

*** SIGFPE (@0x7f4bfeb7707d) received by PID 24362 (TID 0x7f4c29fed380) from PID 18446744073688019069; stack trace: ***

@ 0x7f4c223ba390 (unknown)

@ 0x7f4bfeb7707d cuda::evalNodes<>()

@ 0x7f4bfeb77cbf cuda::evalNodes<>()

@ 0x7f4bfe6bcaea cuda::Array<>::eval()

@ 0x7f4bfd02f851 _ZN4cuda10reduce_allIL7af_op_t5EccEET1_RKNS_5ArrayIT0_EEbd

@ 0x7f4bff585783 af_any_true_all

@ 0x7f4bff75d584 af::anyTrue<>()

@ 0x493640 _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEES_INS0_19FirstOrderOptimizerEES9_ddblE3_clES2_S5_S7_S9_S9_ddbl.constprop.12679

@ 0x41c3b7 main

@ 0x7f4bdf447830 __libc_start_main

@ 0x48e5d9 _start

@ 0x0 (unknown)

Floating point exception (core dumped)

Aug 12 '20 11:08 Bernardo-Favoreto

Please try any combination of the following to get details about the root cause:

export AF_PRINT_ERRORS=1 export AF_TRACE=mem #export AF_TRACE=mem,unified export AF_MAX_BUFFERS=100 #export AF_JIT_KERNEL_TRACE=stdout

export CUDNN_LOGINFO_DBG=1 export CUDNN_LOGDEST_DBG=stderr

Aug 12 '20 21:08 avidov

wav2letter wav2letter copied to clipboard

Help to solve training error

wav2letter
wav2letter copied to clipboard