EAST icon indicating copy to clipboard operation
EAST copied to clipboard

nan during training.

Open logodeeplearning opened this issue 7 years ago • 6 comments

Hi @songdejia, thanks for trying to port EAST from tensorflow. But while trying to train this model on COCO 2014 or Oxford syn text, I get nan during training. Any ideas?

Please see below training Log:

Cross point does not exist point dist to line raise Exception point dist to line raise Exception Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist point dist to line raise Exception point dist to line raise Exception Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist point dist to line raise Exception point dist to line raise Exception Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist point dist to line raise Exception point dist to line raise Exception Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Exception continue Exception in getitem, and choose another index:4393 EAST <==> TRAIN <==> Epoch: [0][1/227] Loss 0.0231 Avg Loss 0.0250)

EAST <==> TRAIN <==> Epoch: [0][2/227] Loss 0.0282 Avg Loss 0.0260)

EAST <==> TRAIN <==> Epoch: [0][3/227] Loss 0.0313 Avg Loss 0.0273)

EAST <==> TRAIN <==> Epoch: [0][4/227] Loss 0.0271 Avg Loss 0.0273)

EAST <==> TRAIN <==> Epoch: [0][5/227] Loss 0.0206 Avg Loss 0.0262)

EAST <==> TRAIN <==> Epoch: [0][6/227] Loss 0.0300 Avg Loss 0.0267)

EAST <==> TRAIN <==> Epoch: [0][7/227] Loss 0.0239 Avg Loss 0.0264)

EAST <==> TRAIN <==> Epoch: [0][8/227] Loss 0.0271 Avg Loss 0.0265)

EAST <==> TRAIN <==> Epoch: [0][9/227] Loss 0.0284 Avg Loss 0.0266)

EAST <==> TRAIN <==> Epoch: [0][10/227] Loss 0.0197 Avg Loss 0.0260)

EAST <==> TRAIN <==> Epoch: [0][11/227] Loss nan Avg Loss nan)

EAST <==> TRAIN <==> Epoch: [0][12/227] Loss nan Avg Loss nan)

logodeeplearning avatar Dec 30 '18 01:12 logodeeplearning

Hi @songdejia, thanks for trying to port EAST from tensorflow. But while trying to train this model on COCO 2014 or Oxford syn text, I get nan during training. Any ideas?

Please see below training Log:

Cross point does not exist point dist to line raise Exception point dist to line raise Exception Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist point dist to line raise Exception point dist to line raise Exception Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist point dist to line raise Exception point dist to line raise Exception Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist point dist to line raise Exception point dist to line raise Exception Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Exception continue Exception in getitem, and choose another index:4393 EAST <==> TRAIN <==> Epoch: [0][1/227] Loss 0.0231 Avg Loss 0.0250)

EAST <==> TRAIN <==> Epoch: [0][2/227] Loss 0.0282 Avg Loss 0.0260)

EAST <==> TRAIN <==> Epoch: [0][3/227] Loss 0.0313 Avg Loss 0.0273)

EAST <==> TRAIN <==> Epoch: [0][4/227] Loss 0.0271 Avg Loss 0.0273)

EAST <==> TRAIN <==> Epoch: [0][5/227] Loss 0.0206 Avg Loss 0.0262)

EAST <==> TRAIN <==> Epoch: [0][6/227] Loss 0.0300 Avg Loss 0.0267)

EAST <==> TRAIN <==> Epoch: [0][7/227] Loss 0.0239 Avg Loss 0.0264)

EAST <==> TRAIN <==> Epoch: [0][8/227] Loss 0.0271 Avg Loss 0.0265)

EAST <==> TRAIN <==> Epoch: [0][9/227] Loss 0.0284 Avg Loss 0.0266)

EAST <==> TRAIN <==> Epoch: [0][10/227] Loss 0.0197 Avg Loss 0.0260)

EAST <==> TRAIN <==> Epoch: [0][11/227] Loss nan Avg Loss nan)

EAST <==> TRAIN <==> Epoch: [0][12/227] Loss nan Avg Loss nan)

Have you worked out any approach to solve the problem?

Caius-Lu avatar Dec 30 '18 07:12 Caius-Lu

@Caius-Lu @songdejia has it occurred to you too? I am trying to debug. suggestions welcome.

logodeeplearning avatar Dec 30 '18 16:12 logodeeplearning

Getting the same issue

viig99 avatar Jan 12 '19 04:01 viig99

I guess due to some sort of issues caused by data augmentation, some data became unpredictably wrong, and causes the loss of this batch become nan. Seeking which specific training images may be the reason can be tedious, so Mm solution is to check if the loss is nan before back propagation, and if so, skip this batch without any updates.

Specifically, I modified the code in main.py as:

loss_check = loss1.cpu().detach().numpy()
if np.any(np.isnan(loss_check)):
    print('loss = nan, skip this batch')
    optimizer.zero_grad()
    continue

BYJRK avatar Mar 31 '19 12:03 BYJRK

@BYJRK What were your results on the ICDAR dataset.

saharudra avatar Apr 01 '19 18:04 saharudra

@saharudra I can at most achieve 0.7 hmean after modifying the thresholds in eval.py on ICDAR 2015 after like 400 epochs. TBH, I don't think this will reproduce the performance mentioned in the paper. Anyway, still trying to figure out the difference from the tensorflow version.

BYJRK avatar Apr 09 '19 12:04 BYJRK