FastMaskRCNN Overflow encountered in exp after 10 iters, and Segmentation fault(core dumped) after 40 iters.

Has anyone encountered these two problems and fixed them? Overflow in exp after 10 iters, and Segmentation fault after 40 iters.

Apr 14 '17 02:04 maxenceliu

Hi @maxenceliu , How long it takes per iteration?

Apr 18 '17 11:04 lihungchieh

2-3secs on GTX1080

Apr 18 '17 15:04 maxenceliu

I've encountered the same problem, it seems that the dw and dh is too small to cause the overflow over exp function.

Apr 19 '17 03:04 Lafi

after the newest commit, total_loss explode after 350 iterations due to the rpn_cls_loss exploded.

Apr 20 '17 03:04 maxenceliu

result is not stable, this time, regular_loss become Nan ater 500 iters...

Apr 20 '17 07:04 maxenceliu

I also encounter the total_loss explosion when try the newest commit. I implement a caffe version of mask-rcnn and also encounter the same problem.

here is the newest commit loss: iter 583: image-id:0272412, time:0.525(sec), regular_loss: 0.167962, total-loss 503.3103(0.0436, 488.1062, 0.000484, 14.5497, 0.6103), instances: 22, batch:(250|1016, 21|86, 21|21)

iter 584: image-id:0262213, time:0.359(sec), regular_loss: 0.177546, total-loss 739.0580(47.7280, 112.5653, 1.444757, 577.0145, 0.3054), instances: 1, batch:(1|33, 1|19, 1|1)

iter 585: image-id:0534559, time:0.429(sec), regular_loss: 0.355617, total-loss nan(nan, 1685118073183372762735414607872.0000, nan, 713696030880020606040835379691520.0000, 4810318291543261184.0000), instances: 16, batch:(128|528, 14|18, 14|14)

Apr 24 '17 10:04 KeyKy

I have got similar issue at iter 493, this seem to be caused by rpn loss explosion. We might need to double check the box matching strategy.

The error:

iter 493: image-id:0254301, time:1.600(sec), regular_loss: 0.215555, total-loss 0.9129(0.0529, 0.8338, 0.000000, 0.0263, 0.0000), instances: 29, batch:(89|372, 0|64, 0|0) train/../libs/boxes/bbox_transform.py:62: RuntimeWarning: overflow encountered in exp pred_h = np.exp(dh) * heights[:, np.newaxis] train/../libs/boxes/bbox_transform.py:62: RuntimeWarning: overflow encountered in multiply pred_h = np.exp(dh) * heights[:, np.newaxis] iter 494: image-id:0115028, time:0.463(sec), regular_loss: 0.215730, total-loss 265.2788(9.3407, 255.5391, 0.000000, 0.3989, 0.0000), instances: 5, batch:(14|69, 0|64, 0|0) train/../libs/boxes/bbox_transform.py:61: RuntimeWarning: overflow encountered in exp pred_w = np.exp(dw) * widths[:, np.newaxis] train/../libs/boxes/bbox_transform.py:61: RuntimeWarning: overflow encountered in multiply pred_w = np.exp(dw) * widths[:, np.newaxis] iter 495: image-id:0428026, time:0.363(sec), regular_loss: 0.221351, total-loss 773565.1250(48176.1797, 704032.8750, 0.000000, 21356.0664, 0.0000), instances: 1, batch:(17|92, 0|8, 0|0) train/../libs/layers/sample.py:144: RuntimeWarning: invalid value encountered in greater_equal keep = np.where((ws >= min_size) & (hs >= min_size))[0] iter 496: image-id:0165883, time:0.399(sec), regular_loss: 61892.210938, total-loss nan(nan, nan, 0.009564, 0.2174, 407103320661720213487616.0000), instances: 4, batch:(6|45, 12|76, 12|12) [[ 365.89337158 268.93334961 729.21331787 530.10668945 17. ] [ 447.27999878 25.89333344 759.37341309 428.58666992 1. ] [ 134.04000854 234.01333618 334.67999268 385.97335815 1. ] [ 353.80001831 273.25335693 852.85339355 632.81335449 60. ]] Traceback (most recent call last): File "train/train.py", line 195, in <module> train() File "train/train.py", line 178, in train raise TypeError: exceptions must be old-style classes or derived from BaseException, not NoneType

Apr 27 '17 21:04 Nikasa1889

I have experienced the same issue, then updated to cuda_8.0.61_375.26 and cudnn to 5.1 and it went away. Could be sporadic too.

Apr 28 '17 22:04 opikalo

I confirm that upgrading to cuda 8.0 fix the problem. Thank you very much @opikalo

Apr 29 '17 21:04 Nikasa1889

I also encounter this problem.

After 29683 iters, it gives warnings:

train/../libs/boxes/bbox_transform.py:62: RuntimeWarning: overflow encountered in exp pred_h = np.exp(dh) * heights[:, np.newaxis] train/../libs/boxes/bbox_transform.py:61: RuntimeWarning: overflow encountered in exp pred_w = np.exp(dw) * widths[:, np.newaxis]

Then, in iter 29684, the loss becomes unusual:

iter 29684: image-id:0094949, time:0.605(sec), regular_loss: 0.179757, total-loss 85438849024.0000(163221872.0000, 73605677056.0000, 30830362.000000, 11639122944.0000, 3.2994), instances: 8, batch:(125|524, 8|12, 8|8) iter 29685: image-id:0357095, time:0.688(sec), regular_loss: 10989575769948160.000000, total-loss 2035863.0000(0.0033, 0.1700, 0.000137, 2035862.8750, 0.0118), instances: 2, batch:(32|152, 2|32, 2|2) iter 29686: image-id:0094952, time:0.764(sec), regular_loss: nan, total-loss 5372209.0000(0.0358, 0.2918, 0.000548, 5372208.5000, 0.0244), instances: 9, batch:(312|1256, 9|46, 9|9)

NaN happens...

@opikalo @Nikasa1889 My cuda version is 8.0 and cudnn is 5.1 already.

May 04 '17 09:05 Kongsea

I've gotten the overflow error quite a few times, all without changing anything. It seems that the overflow errors occur randomly, possibly caused by poor convergence in the weights. Unfortunately the trick for now is to simply restart the training and pray it doesn't overflow again - that's working for me, so far. @Kongsea seems like you got pretty lucky reaching 29000+ iters before seeing overflow, my first overflow was < 1000 iters

May 05 '17 10:05 mrlooi

I would suggest changing the checkpoint value in train/train.py from 10000 to a smaller value e.g. 3000, so that you have more checkpoints in case of overflow error

May 05 '17 10:05 mrlooi

Can you "reproduce" the random occurance with a certain seed for the random initialization?

May 05 '17 11:05 kevinkit

I had similar error. My CUDA is 8.0 and Cudnn is 5.1. I found that I didn't add the path to the env.

export PATH=/usr/local/cuda-8.0/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH After adding the path, the problem was solved.

May 10 '17 14:05 tianzq

Check my comment here

Feb 19 '18 08:02 meetps

FastMaskRCNN FastMaskRCNN copied to clipboard

Overflow encountered in exp after 10 iters, and Segmentation fault(core dumped) after 40 iters.

FastMaskRCNN
FastMaskRCNN copied to clipboard