Mask_RCNN
Mask_RCNN copied to clipboard
Nan loss
I have changed the Mask RCNN code and annotations to include with every annotation also a level_id. My project is about flood level estimation. So every annotation has associated class_id as well as level_id. The level_id denotes the level of the flood in the images
I have implemented the level_loss exactly as class loss which is defined in function mrcnn_class_loss_graph(). I have used only some classes of coco dataset like person, bus, car, bicycle and added around 1200 images to the dataset of my own of flood. I have also added 2 classes flood and house to the annotation file. All annotations from coco dataset has been assigned "no level" which is denoted by level_id 1. So the total level_ids are 12 and 6 class ids.
I have also made required changes in PythonAPI to support level_ids and level_loss. So after starting training around 4500 iteration the level loss turns to nan which in turn makes the whole loss nan. Sometimes also little earlier. Does have any idea why that might be happening?
I have tried lowering LEARNING_RATE till 0.00001 but I still get nan in level_loss.
I also tried the numbering of level_ids from 0-11 to 1-12 but still no effect.
I have attached below some part of the training log of loss
@waleedka : Can you give any suggestion where I might be going wrong? or even how to debug to get to the cause? Any help would be really appreciated.
Thanks in advance.
784/1000 [======================>.......] - ETA: 4:20 - loss: 1.0241 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1957 - mrcnn_class_loss: 0.1284 - mrcnn_bbox_loss: 0.2787 - mrcnn_mask_loss: 0.2853 - mrcnn_level_loss: 0.1258
785/1000 [======================>.......] - ETA: 4:18 - loss: 1.0237 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1956 - mrcnn_class_loss: 0.1283 - mrcnn_bbox_loss: 0.2787 - mrcnn_mask_loss: 0.2852 - mrcnn_level_loss: 0.1257
786/1000 [======================>.......] - ETA: 4:17 - loss: 1.0238 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1956 - mrcnn_class_loss: 0.1283 - mrcnn_bbox_loss: 0.2787 - mrcnn_mask_loss: 0.2852 - mrcnn_level_loss: 0.1258
787/1000 [======================>.......] - ETA: 4:16 - loss: 1.0233 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1955 - mrcnn_class_loss: 0.1282 - mrcnn_bbox_loss: 0.2786 - mrcnn_mask_loss: 0.2851 - mrcnn_level_loss: 0.1257
788/1000 [======================>.......] - ETA: 4:15 - loss: 1.0226 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1953 - mrcnn_class_loss: 0.1281 - mrcnn_bbox_loss: 0.2784 - mrcnn_mask_loss: 0.2850 - mrcnn_level_loss: 0.1256
789/1000 [======================>.......] - ETA: 4:14 - loss: 1.0225 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1953 - mrcnn_class_loss: 0.1281 - mrcnn_bbox_loss: 0.2783 - mrcnn_mask_loss: 0.2850 - mrcnn_level_loss: 0.1256
790/1000 [======================>.......] - ETA: 4:12 - loss: 1.0221 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1952 - mrcnn_class_loss: 0.1281 - mrcnn_bbox_loss: 0.2782 - mrcnn_mask_loss: 0.2849 - mrcnn_level_loss: 0.1256
791/1000 [======================>.......] - ETA: 4:11 - loss: 1.0220 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1956 - mrcnn_class_loss: 0.1279 - mrcnn_bbox_loss: 0.2781 - mrcnn_mask_loss: 0.2848 - mrcnn_level_loss: 0.1254
792/1000 [======================>.......] - ETA: 4:10 - loss: 1.0219 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1956 - mrcnn_class_loss: 0.1280 - mrcnn_bbox_loss: 0.2780 - mrcnn_mask_loss: 0.2847 - mrcnn_level_loss: 0.1255
793/1000 [======================>.......] - ETA: 4:09 - loss: 1.0220 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1957 - mrcnn_class_loss: 0.1278 - mrcnn_bbox_loss: 0.2781 - mrcnn_mask_loss: 0.2848 - mrcnn_level_loss: 0.1253
794/1000 [======================>.......] - ETA: 4:08 - loss: 1.0214 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1956 - mrcnn_class_loss: 0.1277 - mrcnn_bbox_loss: 0.2779 - mrcnn_mask_loss: 0.2847 - mrcnn_level_loss: 0.1252
795/1000 [======================>.......] - ETA: 4:06 - loss: 1.0217 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1956 - mrcnn_class_loss: 0.1279 - mrcnn_bbox_loss: 0.2778 - mrcnn_mask_loss: 0.2847 - mrcnn_level_loss: 0.1254
796/1000 [======================>.......] - ETA: 4:05 - loss: 1.0218 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1956 - mrcnn_class_loss: 0.1280 - mrcnn_bbox_loss: 0.2778 - mrcnn_mask_loss: 0.2847 - mrcnn_level_loss: 0.1255
797/1000 [======================>.......] - ETA: 4:04 - loss: 1.0226 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1964 - mrcnn_class_loss: 0.1280 - mrcnn_bbox_loss: 0.2778 - mrcnn_mask_loss: 0.2847 - mrcnn_level_loss: 0.1255
798/1000 [======================>.......] - ETA: 4:03 - loss: 1.0217 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1962 - mrcnn_class_loss: 0.1279 - mrcnn_bbox_loss: 0.2775 - mrcnn_mask_loss: 0.2846 - mrcnn_level_loss: 0.1254
799/1000 [======================>.......] - ETA: 4:02 - loss: 1.0217 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1963 - mrcnn_class_loss: 0.1279 - mrcnn_bbox_loss: 0.2775 - mrcnn_mask_loss: 0.2846 - mrcnn_level_loss: 0.1254
800/1000 [=======================>......] - ETA: 4:00 - loss: 1.0219 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1962 - mrcnn_class_loss: 0.1280 - mrcnn_bbox_loss: 0.2775 - mrcnn_mask_loss: 0.2846 - mrcnn_level_loss: 0.1255
801/1000 [=======================>......] - ETA: 3:59 - loss: 1.0227 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1964 - mrcnn_class_loss: 0.1283 - mrcnn_bbox_loss: 0.2775 - mrcnn_mask_loss: 0.2846 - mrcnn_level_loss: 0.1257
802/1000 [=======================>......] - ETA: 3:58 - loss: 1.0225 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1963 - mrcnn_class_loss: 0.1282 - mrcnn_bbox_loss: 0.2775 - mrcnn_mask_loss: 0.2846 - mrcnn_level_loss: 0.1257
803/1000 [=======================>......] - ETA: 3:57 - loss: 1.0229 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1962 - mrcnn_class_loss: 0.1284 - mrcnn_bbox_loss: 0.2775 - mrcnn_mask_loss: 0.2847 - mrcnn_level_loss: 0.1259
804/1000 [=======================>......] - ETA: 3:55 - loss: 1.0221 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1961 - mrcnn_class_loss: 0.1283 - mrcnn_bbox_loss: 0.2772 - mrcnn_mask_loss: 0.2845 - mrcnn_level_loss: 0.1258
805/1000 [=======================>......] - ETA: 3:54 - loss: 1.0218 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1960 - mrcnn_class_loss: 0.1283 - mrcnn_bbox_loss: 0.2771 - mrcnn_mask_loss: 0.2844 - mrcnn_level_loss: 0.1258
806/1000 [=======================>......] - ETA: 3:53 - loss: 1.0209 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1959 - mrcnn_class_loss: 0.1282 - mrcnn_bbox_loss: 0.2768 - mrcnn_mask_loss: 0.2842 - mrcnn_level_loss: 0.1256
807/1000 [=======================>......] - ETA: 3:52 - loss: 1.0203 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1959 - mrcnn_class_loss: 0.1281 - mrcnn_bbox_loss: 0.2767 - mrcnn_mask_loss: 0.2840 - mrcnn_level_loss: 0.1255
808/1000 [=======================>......] - ETA: 3:51 - loss: 1.0193 - rpn_class_loss: 0.0101 - rpn_bbox_loss: 0.1957 - mrcnn_class_loss: 0.1279 - mrcnn_bbox_loss: 0.2764 - mrcnn_mask_loss: 0.2838 - mrcnn_level_loss: 0.1254
809/1000 [=======================>......] - ETA: 3:49 - loss: 1.0191 - rpn_class_loss: 0.0101 - rpn_bbox_loss: 0.1956 - mrcnn_class_loss: 0.1279 - mrcnn_bbox_loss: 0.2764 - mrcnn_mask_loss: 0.2838 - mrcnn_level_loss: 0.1253
810/1000 [=======================>......] - ETA: 3:48 - loss: 1.0196 - rpn_class_loss: 0.0101 - rpn_bbox_loss: 0.1963 - mrcnn_class_loss: 0.1278 - mrcnn_bbox_loss: 0.2763 - mrcnn_mask_loss: 0.2838 - mrcnn_level_loss: 0.1253
811/1000 [=======================>......] - ETA: 3:47 - loss: 1.0195 - rpn_class_loss: 0.0101 - rpn_bbox_loss: 0.1961 - mrcnn_class_loss: 0.1279 - mrcnn_bbox_loss: 0.2762 - mrcnn_mask_loss: 0.2838 - mrcnn_level_loss: 0.1253
812/1000 [=======================>......] - ETA: 3:46 - loss: 1.0190 - rpn_class_loss: 0.0101 - rpn_bbox_loss: 0.1961 - mrcnn_class_loss: 0.1278 - mrcnn_bbox_loss: 0.2761 - mrcnn_mask_loss: 0.2838 - mrcnn_level_loss: 0.1252
813/1000 [=======================>......] - ETA: 3:45 - loss: 1.0194 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1962 - mrcnn_class_loss: 0.1278 - mrcnn_bbox_loss: 0.2761 - mrcnn_mask_loss: 0.2837 - mrcnn_level_loss: 0.1253
814/1000 [=======================>......] - ETA: 3:43 - loss: 1.0188 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1961 - mrcnn_class_loss: 0.1278 - mrcnn_bbox_loss: 0.2759 - mrcnn_mask_loss: 0.2836 - mrcnn_level_loss: 0.1252
815/1000 [=======================>......] - ETA: 3:42 - loss: 1.0191 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1963 - mrcnn_class_loss: 0.1278 - mrcnn_bbox_loss: 0.2760 - mrcnn_mask_loss: 0.2837 - mrcnn_level_loss: 0.1252
816/1000 [=======================>......] - ETA: 3:41 - loss: 1.0193 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1963 - mrcnn_class_loss: 0.1279 - mrcnn_bbox_loss: 0.2759 - mrcnn_mask_loss: 0.2837 - mrcnn_level_loss: 0.1253
817/1000 [=======================>......] - ETA: 3:40 - loss: 1.0195 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1962 - mrcnn_class_loss: 0.1279 - mrcnn_bbox_loss: 0.2761 - mrcnn_mask_loss: 0.2838 - mrcnn_level_loss: 0.1254
818/1000 [=======================>......] - ETA: 3:39 - loss: 1.0190 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1962 - mrcnn_class_loss: 0.1278 - mrcnn_bbox_loss: 0.2759 - mrcnn_mask_loss: 0.2837 - mrcnn_level_loss: 0.1252
819/1000 [=======================>......] - ETA: 3:37 - loss: 1.0183 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1961 - mrcnn_class_loss: 0.1277 - mrcnn_bbox_loss: 0.2756 - mrcnn_mask_loss: 0.2835 - mrcnn_level_loss: 0.1251
820/1000 [=======================>......] - ETA: 3:36 - loss: 1.0183 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1960 - mrcnn_class_loss: 0.1277 - mrcnn_bbox_loss: 0.2757 - mrcnn_mask_loss: 0.2836 - mrcnn_level_loss: 0.1252
821/1000 [=======================>......] - ETA: 3:35 - loss: 1.0181 - rpn_class_loss: 0.0102 - rpn_bbox_loss: 0.1959 - mrcnn_class_loss: 0.1277 - mrcnn_bbox_loss: 0.2755 - mrcnn_mask_loss: 0.2835 - mrcnn_level_loss: 0.1252
822/1000 [=======================>......] - ETA: 3:34 - loss: nan - rpn_class_loss: 0.0103 - rpn_bbox_loss: 0.1961 - mrcnn_class_loss: 0.1280 - mrcnn_bbox_loss: 0.2757 - mrcnn_mask_loss: 0.2837 - mrcnn_level_loss: nan
823/1000 [=======================>......] - ETA: 3:33 - loss: nan - rpn_class_loss: 0.0111 - rpn_bbox_loss: 0.1968 - mrcnn_class_loss: 0.1374 - mrcnn_bbox_loss: 0.2754 - mrcnn_mask_loss: 0.2833 - mrcnn_level_loss: nan
824/1000 [=======================>......] - ETA: 3:31 - loss: nan - rpn_class_loss: 0.0120 - rpn_bbox_loss: 0.1968 - mrcnn_class_loss: 0.1396 - mrcnn_bbox_loss: 0.2750 - mrcnn_mask_loss: 0.2830 - mrcnn_level_loss: nan
825/1000 [=======================>......] - ETA: 3:30 - loss: nan - rpn_class_loss: 0.0128 - rpn_bbox_loss: 0.1977 - mrcnn_class_loss: 0.1418 - mrcnn_bbox_loss: 0.2747 - mrcnn_mask_loss: 0.2826 - mrcnn_level_loss: nan
826/1000 [=======================>......] - ETA: 3:29 - loss: nan - rpn_class_loss: 0.0137 - rpn_bbox_loss: 0.1984 - mrcnn_class_loss: 0.1439 - mrcnn_bbox_loss: 0.2744 - mrcnn_mask_loss: 0.2823 - mrcnn_level_loss: nan
827/1000 [=======================>......] - ETA: 3:28 - loss: nan - rpn_class_loss: 0.0145 - rpn_bbox_loss: 0.1986 - mrcnn_class_loss: 0.1461 - mrcnn_bbox_loss: 0.2740 - mrcnn_mask_loss: 0.2820 - mrcnn_level_loss: nan
828/1000 [=======================>......] - ETA: 3:26 - loss: nan - rpn_class_loss: 0.0153 - rpn_bbox_loss: 0.1995 - mrcnn_class_loss: 0.1483 - mrcnn_bbox_loss: 0.2737 - mrcnn_mask_loss: 0.2816 - mrcnn_level_loss: nan
829/1000 [=======================>......] - ETA: 3:25 - loss: nan - rpn_class_loss: 0.0162 - rpn_bbox_loss: 0.2001 - mrcnn_class_loss: 0.1504 - mrcnn_bbox_loss: 0.2734 - mrcnn_mask_loss: 0.2813 - mrcnn_level_loss: nan
830/1000 [=======================>......] - ETA: 3:24 - loss: nan - rpn_class_loss: 0.0170 - rpn_bbox_loss: 0.2002 - mrcnn_class_loss: 0.1526 - mrcnn_bbox_loss: 0.2730 - mrcnn_mask_loss: 0.2809 - mrcnn_level_loss: nan
831/1000 [=======================>......] - ETA: 3:23 - loss: nan - rpn_class_loss: 0.0178 - rpn_bbox_loss: 0.2005 - mrcnn_class_loss: 0.1548 - mrcnn_bbox_loss: 0.2727 - mrcnn_mask_loss: 0.2806 - mrcnn_level_loss: nan
@priyanka-chaudhary Have you checked that your NUM_CLASS in your own config is consistent with the actual number of classes? I Had the same issue but solved it putting the right number there (I put NUM_CLASS=4 instead of NUM_CLASS=7 and when it met class # 5, it returned NaN value).
@priyanka-chaudhary hi~have you solved this problem? I just encountered the same problem as yours and don't know the reason.
@Paulito-7 : thank you for the suggestion. It was something else for me I have mentioned it below in case you are interested.
@xelmirage : My error was due to the fact that I assigned my level classes from 1 to 12. But logits start from 0 so whenever it encountered 12 it gave nan. I would suggest check your input data if you don't see any implementation error.
I also had this problem and I solved it.
As everyone mentioned in different issues raised in this repo, the problem is with the learning rate.
In my case the original setting in config file is:
BASE_LR: 0.02 | STEPS: (60000, 80000) | MAX_ITER: 90000
which caused nan for loss after the 3rd iteration! Then I changed it to:
BASE_LR: 0.0025 | STEPS: (480000, 640000) | MAX_ITER: 720000
which comes from dividing the first by 8, and multiplying the other two by 8, as suggested in the readme here.
The default setting is set for 8 GPUs. I have only 2. So, some changes were expected.
However, the above changes made the expectation time for training (i.e., eta) from 4 days to 41 days! So, I avoided such a long training by only changing BASE_LR form 0.02 to 0.01. To evaluate whether this is enough or not, I have to see the loss plot and where it plateaus.
I also had this problem and I solved it. As everyone mentioned in different issues raised in this repo, the problem is with the learning rate. In my case the original setting in config file is:
BASE_LR: 0.02 | STEPS: (60000, 80000) | MAX_ITER: 90000which causednanfor loss after the 3rd iteration! Then I changed it to:BASE_LR: 0.0025 | STEPS: (480000, 640000) | MAX_ITER: 720000which comes from dividing the first by 8, and multiplying the other two by 8, as suggested in the readme here. The default setting is set for 8 GPUs. I have only 2. So, some changes were expected.However, the above changes made the expectation time for training (i.e.,
eta) from 4 days to 41 days! So, I avoided such a long training by only changingBASE_LRform0.02to0.01. To evaluate whether this is enough or not, I have to see the loss plot and where it plateaus.
Thank you so much !!! It worked for me. I was training at a lr of 0.012 and then i reduced it to 0.002 and it worked !! :)
Hi I have a similar issue and do not know where I go wrong.
Env: tersorflow 2.2.2 keras 2.3.1
Repo: https://github.com/ahmedfgad/Mask-RCNN-TF2
here is what I get from keras_model.fit() Epoch 1/3
1/6 [====>.........................] - ETA: 11:07 - loss: 121.6802
2/6 [=========>....................] - ETA: 4:27 - loss: 63.0578
3/6 [==============>...............] - ETA: 2:13 - loss: nan
4/6 [===================>..........] - ETA: 1:07 - loss: nan
5/6 [========================>.....] - ETA: 26s - loss: nan
6/6 [==============================] - 139s 23s/step - loss: nan - val_loss: nan
Epoch 2/3
1/6 [====>.........................] - ETA: 1s - loss: nan 2/6 [=========>....................] - ETA: 0s - loss: nan 3/6 [==============>...............] - ETA: 0s - loss: nan 4/6 [===================>..........] - ETA: 0s - loss: nan 5/6 [========================>.....] - ETA: 0s - loss: nan 6/6 [==============================] - 5s 836ms/step - loss: nan - val_loss: nan
I have tried learning rate from 1e-2 to 1e-10. Only from 1e-4 to 1e-6 show a loss number in the first one or two steps in the first epoch. Other learning rates show nan in all steps. I am pretty sure the num_class is correct - I have 2 classes + 1 background.
If anyone has any idea, please help. Thanks!
/6 [====>.........................] - ETA: 11:07 - loss: 121.6802 2/6 [=========>....................] - ETA: 4:27 - loss: 63.0578
DID YOU SOLVE IT?