FastMaskRCNN icon indicating copy to clipboard operation
FastMaskRCNN copied to clipboard

Overflow encountered in exp after 10 iters, and Segmentation fault(core dumped) after 40 iters.

Open maxenceliu opened this issue 7 years ago • 15 comments

Has anyone encountered these two problems and fixed them? Overflow in exp after 10 iters, and Segmentation fault after 40 iters.

maxenceliu avatar Apr 14 '17 02:04 maxenceliu

Hi @maxenceliu , How long it takes per iteration?

lihungchieh avatar Apr 18 '17 11:04 lihungchieh

2-3secs on GTX1080

maxenceliu avatar Apr 18 '17 15:04 maxenceliu

I've encountered the same problem, it seems that the dw and dh is too small to cause the overflow over exp function.

image

Lafi avatar Apr 19 '17 03:04 Lafi

after the newest commit, total_loss explode after 350 iterations due to the rpn_cls_loss exploded.

maxenceliu avatar Apr 20 '17 03:04 maxenceliu

result is not stable, this time, regular_loss become Nan ater 500 iters...

maxenceliu avatar Apr 20 '17 07:04 maxenceliu

I also encounter the total_loss explosion when try the newest commit. I implement a caffe version of mask-rcnn and also encounter the same problem.

here is the newest commit loss: iter 583: image-id:0272412, time:0.525(sec), regular_loss: 0.167962, total-loss 503.3103(0.0436, 488.1062, 0.000484, 14.5497, 0.6103), instances: 22, batch:(250|1016, 21|86, 21|21)

iter 584: image-id:0262213, time:0.359(sec), regular_loss: 0.177546, total-loss 739.0580(47.7280, 112.5653, 1.444757, 577.0145, 0.3054), instances: 1, batch:(1|33, 1|19, 1|1)

iter 585: image-id:0534559, time:0.429(sec), regular_loss: 0.355617, total-loss nan(nan, 1685118073183372762735414607872.0000, nan, 713696030880020606040835379691520.0000, 4810318291543261184.0000), instances: 16, batch:(128|528, 14|18, 14|14)

KeyKy avatar Apr 24 '17 10:04 KeyKy

I have got similar issue at iter 493, this seem to be caused by rpn loss explosion. We might need to double check the box matching strategy.

The error:

iter 493: image-id:0254301, time:1.600(sec), regular_loss: 0.215555, total-loss 0.9129(0.0529, 0.8338, 0.000000, 0.0263, 0.0000), instances: 29, batch:(89|372, 0|64, 0|0) train/../libs/boxes/bbox_transform.py:62: RuntimeWarning: overflow encountered in exp pred_h = np.exp(dh) * heights[:, np.newaxis] train/../libs/boxes/bbox_transform.py:62: RuntimeWarning: overflow encountered in multiply pred_h = np.exp(dh) * heights[:, np.newaxis] iter 494: image-id:0115028, time:0.463(sec), regular_loss: 0.215730, total-loss 265.2788(9.3407, 255.5391, 0.000000, 0.3989, 0.0000), instances: 5, batch:(14|69, 0|64, 0|0) train/../libs/boxes/bbox_transform.py:61: RuntimeWarning: overflow encountered in exp pred_w = np.exp(dw) * widths[:, np.newaxis] train/../libs/boxes/bbox_transform.py:61: RuntimeWarning: overflow encountered in multiply pred_w = np.exp(dw) * widths[:, np.newaxis] iter 495: image-id:0428026, time:0.363(sec), regular_loss: 0.221351, total-loss 773565.1250(48176.1797, 704032.8750, 0.000000, 21356.0664, 0.0000), instances: 1, batch:(17|92, 0|8, 0|0) train/../libs/layers/sample.py:144: RuntimeWarning: invalid value encountered in greater_equal keep = np.where((ws >= min_size) & (hs >= min_size))[0] iter 496: image-id:0165883, time:0.399(sec), regular_loss: 61892.210938, total-loss nan(nan, nan, 0.009564, 0.2174, 407103320661720213487616.0000), instances: 4, batch:(6|45, 12|76, 12|12) [[ 365.89337158 268.93334961 729.21331787 530.10668945 17. ] [ 447.27999878 25.89333344 759.37341309 428.58666992 1. ] [ 134.04000854 234.01333618 334.67999268 385.97335815 1. ] [ 353.80001831 273.25335693 852.85339355 632.81335449 60. ]] Traceback (most recent call last): File "train/train.py", line 195, in <module> train() File "train/train.py", line 178, in train raise TypeError: exceptions must be old-style classes or derived from BaseException, not NoneType

Nikasa1889 avatar Apr 27 '17 21:04 Nikasa1889

I have experienced the same issue, then updated to cuda_8.0.61_375.26 and cudnn to 5.1 and it went away. Could be sporadic too.

opikalo avatar Apr 28 '17 22:04 opikalo

I confirm that upgrading to cuda 8.0 fix the problem. Thank you very much @opikalo

Nikasa1889 avatar Apr 29 '17 21:04 Nikasa1889

I also encounter this problem.

After 29683 iters, it gives warnings:

train/../libs/boxes/bbox_transform.py:62: RuntimeWarning: overflow encountered in exp pred_h = np.exp(dh) * heights[:, np.newaxis] train/../libs/boxes/bbox_transform.py:61: RuntimeWarning: overflow encountered in exp pred_w = np.exp(dw) * widths[:, np.newaxis]

Then, in iter 29684, the loss becomes unusual:

iter 29684: image-id:0094949, time:0.605(sec), regular_loss: 0.179757, total-loss 85438849024.0000(163221872.0000, 73605677056.0000, 30830362.000000, 11639122944.0000, 3.2994), instances: 8, batch:(125|524, 8|12, 8|8) iter 29685: image-id:0357095, time:0.688(sec), regular_loss: 10989575769948160.000000, total-loss 2035863.0000(0.0033, 0.1700, 0.000137, 2035862.8750, 0.0118), instances: 2, batch:(32|152, 2|32, 2|2) iter 29686: image-id:0094952, time:0.764(sec), regular_loss: nan, total-loss 5372209.0000(0.0358, 0.2918, 0.000548, 5372208.5000, 0.0244), instances: 9, batch:(312|1256, 9|46, 9|9)

NaN happens...

@opikalo @Nikasa1889 My cuda version is 8.0 and cudnn is 5.1 already.

Kongsea avatar May 04 '17 09:05 Kongsea

I've gotten the overflow error quite a few times, all without changing anything. It seems that the overflow errors occur randomly, possibly caused by poor convergence in the weights. Unfortunately the trick for now is to simply restart the training and pray it doesn't overflow again - that's working for me, so far. @Kongsea seems like you got pretty lucky reaching 29000+ iters before seeing overflow, my first overflow was < 1000 iters

mrlooi avatar May 05 '17 10:05 mrlooi

I would suggest changing the checkpoint value in train/train.py from 10000 to a smaller value e.g. 3000, so that you have more checkpoints in case of overflow error

mrlooi avatar May 05 '17 10:05 mrlooi

Can you "reproduce" the random occurance with a certain seed for the random initialization?

kevinkit avatar May 05 '17 11:05 kevinkit

I had similar error. My CUDA is 8.0 and Cudnn is 5.1. I found that I didn't add the path to the env.

export PATH=/usr/local/cuda-8.0/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH After adding the path, the problem was solved.

tianzq avatar May 10 '17 14:05 tianzq

Check my comment here

meetps avatar Feb 19 '18 08:02 meetps