nan during training.
Hi @songdejia, thanks for trying to port EAST from tensorflow. But while trying to train this model on COCO 2014 or Oxford syn text, I get nan during training. Any ideas?
Please see below training Log:
Cross point does not exist point dist to line raise Exception point dist to line raise Exception Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist point dist to line raise Exception point dist to line raise Exception Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist point dist to line raise Exception point dist to line raise Exception Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist point dist to line raise Exception point dist to line raise Exception Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Exception continue Exception in getitem, and choose another index:4393 EAST <==> TRAIN <==> Epoch: [0][1/227] Loss 0.0231 Avg Loss 0.0250)
EAST <==> TRAIN <==> Epoch: [0][2/227] Loss 0.0282 Avg Loss 0.0260)
EAST <==> TRAIN <==> Epoch: [0][3/227] Loss 0.0313 Avg Loss 0.0273)
EAST <==> TRAIN <==> Epoch: [0][4/227] Loss 0.0271 Avg Loss 0.0273)
EAST <==> TRAIN <==> Epoch: [0][5/227] Loss 0.0206 Avg Loss 0.0262)
EAST <==> TRAIN <==> Epoch: [0][6/227] Loss 0.0300 Avg Loss 0.0267)
EAST <==> TRAIN <==> Epoch: [0][7/227] Loss 0.0239 Avg Loss 0.0264)
EAST <==> TRAIN <==> Epoch: [0][8/227] Loss 0.0271 Avg Loss 0.0265)
EAST <==> TRAIN <==> Epoch: [0][9/227] Loss 0.0284 Avg Loss 0.0266)
EAST <==> TRAIN <==> Epoch: [0][10/227] Loss 0.0197 Avg Loss 0.0260)
EAST <==> TRAIN <==> Epoch: [0][11/227] Loss nan Avg Loss nan)
EAST <==> TRAIN <==> Epoch: [0][12/227] Loss nan Avg Loss nan)
Hi @songdejia, thanks for trying to port EAST from tensorflow. But while trying to train this model on COCO 2014 or Oxford syn text, I get nan during training. Any ideas?
Please see below training Log:
Cross point does not exist point dist to line raise Exception point dist to line raise Exception Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist point dist to line raise Exception point dist to line raise Exception Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist point dist to line raise Exception point dist to line raise Exception Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist point dist to line raise Exception point dist to line raise Exception Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Cross point does not exist Exception continue Exception in getitem, and choose another index:4393 EAST <==> TRAIN <==> Epoch: [0][1/227] Loss 0.0231 Avg Loss 0.0250)
EAST <==> TRAIN <==> Epoch: [0][2/227] Loss 0.0282 Avg Loss 0.0260)
EAST <==> TRAIN <==> Epoch: [0][3/227] Loss 0.0313 Avg Loss 0.0273)
EAST <==> TRAIN <==> Epoch: [0][4/227] Loss 0.0271 Avg Loss 0.0273)
EAST <==> TRAIN <==> Epoch: [0][5/227] Loss 0.0206 Avg Loss 0.0262)
EAST <==> TRAIN <==> Epoch: [0][6/227] Loss 0.0300 Avg Loss 0.0267)
EAST <==> TRAIN <==> Epoch: [0][7/227] Loss 0.0239 Avg Loss 0.0264)
EAST <==> TRAIN <==> Epoch: [0][8/227] Loss 0.0271 Avg Loss 0.0265)
EAST <==> TRAIN <==> Epoch: [0][9/227] Loss 0.0284 Avg Loss 0.0266)
EAST <==> TRAIN <==> Epoch: [0][10/227] Loss 0.0197 Avg Loss 0.0260)
EAST <==> TRAIN <==> Epoch: [0][11/227] Loss nan Avg Loss nan)
EAST <==> TRAIN <==> Epoch: [0][12/227] Loss nan Avg Loss nan)
Have you worked out any approach to solve the problem?
@Caius-Lu @songdejia has it occurred to you too? I am trying to debug. suggestions welcome.
Getting the same issue
I guess due to some sort of issues caused by data augmentation, some data became unpredictably wrong, and causes the loss of this batch become nan. Seeking which specific training images may be the reason can be tedious, so Mm solution is to check if the loss is nan before back propagation, and if so, skip this batch without any updates.
Specifically, I modified the code in main.py as:
loss_check = loss1.cpu().detach().numpy()
if np.any(np.isnan(loss_check)):
print('loss = nan, skip this batch')
optimizer.zero_grad()
continue
@BYJRK What were your results on the ICDAR dataset.
@saharudra I can at most achieve 0.7 hmean after modifying the thresholds in eval.py on ICDAR 2015 after like 400 epochs. TBH, I don't think this will reproduce the performance mentioned in the paper. Anyway, still trying to figure out the difference from the tensorflow version.