fast-rcnn icon indicating copy to clipboard operation
fast-rcnn copied to clipboard

Getting loss_bbox = nan while training on caltech dataset

Open HarisIqbal88 opened this issue 8 years ago • 18 comments

Hi, I am training VGG16 on caltech dataset with my own proposals (not using selective search). However, while training I am getting loss_bbox = nan whereas, loss_cls starts from somewhere near 2 and reduces to less than 0.005 by the end of training. Any thoughts on what went wrong?

HarisIqbal88 avatar Mar 01 '16 14:03 HarisIqbal88

@HarisIqbal88 hello!Did you solve this problem? I have the same issue like you, so could you please give me some suggestions?Thank you very much.

xingkongliang avatar May 29 '16 20:05 xingkongliang

Same issue here with my own dataset. drop the base_lr did not work. @xingkongliang seems an issue with the input data, with some part of my dataset works.

alviur avatar May 30 '16 01:05 alviur

@alviur Hi, do you solve this problem? I have no idea.How do you get the object proposals?

xingkongliang avatar May 30 '16 07:05 xingkongliang

Hi all, yes I solved the problem. In my case, the problem was with the number of proposals per image. So, if reducing learning rate (base_lr=0) does not help, you might want to go to config file and see how many positive and negative proposals are be sampled from each image while training. Then, check if you have an image in your dataset with positive or negative proposals less than the numbers you got from config file, and there lies your problem. Basically, if some entries for calculating loss_bbx are missing, they are taken as nan and as the loss_bbx is defined as norm of difference between bbxs, it turns out to be nan.

HarisIqbal88 avatar May 31 '16 14:05 HarisIqbal88

@HarisIqbal88 alviur Thank you very much, I solved the problem. I found that there is negative number (coordinate)in the ground truth data. The negative number produce a nan problem, so I fix it to zero. This problem is solved. @alviur

xingkongliang avatar Jun 02 '16 02:06 xingkongliang

@HarisIqbal88 Thanks for the answer, that makes sense. So did you set the loss to be '0' where ever the bounding box is empty? Can you share the modified code. I am guessing you changed smooth_L1_loss_layer.cu ? Can you share the code. I am new to cuda.

saiprabhakar avatar Jun 13 '16 20:06 saiprabhakar

@saiprabhakar well, it does not make sense to give an image with very few proposals for training. You should either increase number of proposal boxes by altering some parameters in you proposal generating algorithm or ignore the image entirely in python interface. There is no need to modify loss layer (and I did not do it as well).

HarisIqbal88 avatar Jun 13 '16 21:06 HarisIqbal88

@HarisIqbal88 thanks for the quick response. I think, I am giving enough number of proposal boxes. I am also working on Caltech dataset using ACF region proposal. I will look into it if this is the problem.

I was thinking the problem is because there isnt enough number of pedestrian bounding boxes (and a lot of background bounding boxes). What do you think?

saiprabhakar avatar Jun 14 '16 02:06 saiprabhakar

@saiprabhakar By number of proposals, I meant both positive and negative classes seperately will have to fulfill the condition mentioned in the config file. As far as Caltech with ACF is concerned, there are some images with absolutely no positive class. Then, you either ignore those images or you put some meaningless dummy bounding box of zero area so that they dont hurt the training process.

HarisIqbal88 avatar Jun 14 '16 10:06 HarisIqbal88

@HarisIqbal88 i see what you mean. I took a closer look and found that the problem is in the config file the threshold for considering a bounding box in training state was set to 0.5. When I decreased this to 0.1 the NaN vanished. I think the higher threshold prevented selecting background bounding boxes.

saiprabhakar avatar Jun 14 '16 15:06 saiprabhakar

@saiprabhakar There are two things to consider here: First, decreasing threshold to 0.1 may remove the "nan" from training error but that does not mean its the right thing to do. Decreasing this thresholf to 0.1 means that a mere 10% overlap with ground truth will be considered a valid proposal. You have to decide for your problem if this is a right thing to do. Remmeber, the goal is not to magically vanish "nan" from the display screen or log file. Second, there are images with absolutely no ground truth positive classes (i.e., images with no pedestrians) in Caltech Dataset. When those will come for training, your 0.1 technique is bound to fail for them because there will be no positive bounding boxes to consider for training which is being sought by the network for training. Thus, the only way around this problem is to either remove these images from the training or come up with some dummy bounding boxes in ground truth (as they will automatically be appended to proposals, you do not need to add them to proposals as well).

HarisIqbal88 avatar Jun 15 '16 01:06 HarisIqbal88

@HarisIqbal88 Thanks for the insight, that makes sense. The core problem seems to be lack of region proposals overlapping with sufficient IoU with ground truth. There seems to be handful of images with 1 or 2, ROIs having overlap greater than .1 with ground truth.

Note: I did separated images from the data set that doesnt have and ground truth (pedestrians) in them.

I am using the ACF model given along with the Caltech dataset (or with Piotr-toolbox I dont remember). It doesnt seems to produce good region proposals (with overlap).

Which region proposal did you use? If you used ACF proposal, did you train it or used the model they provided? What setting did you use for the region proposal (threshold, calibration)?

saiprabhakar avatar Jun 15 '16 18:06 saiprabhakar

@saiprabhakar I used ACF proposal but I changed many variables in the code to produce many proposals (typically more than 500 per image). Unfortunately, now I can not access the code as it was in my internship account which is now closed. Therefore, I can not provide you with those variations. However, the quality of proposals does not matter much. If you can just duplicate already existing proposals with some suitable noise model, it should be good enough. Remember, you are also training your network. So, it will take care of that.

HarisIqbal88 avatar Jun 15 '16 21:06 HarisIqbal88

@HarisIqbal88 Ok I see. I changed my threshold to get ~1000 proposals. I want to confirm something, you said you added dummy boxes of zero width right, can you give more details on it.

I am guessing you labelled them as ground truth. Did you select them randomly? Or, where they centered in your ground truth bounding box. After seeing the cost function, I think it will affect the performance of the network, am I right? I am thinking of avoiding it because of this.

saiprabhakar avatar Jun 15 '16 22:06 saiprabhakar

@HarisIqbal88 Did you also retrain ACF (I think it used boosting ) or used the given model and changed parameter during testing?

saiprabhakar avatar Jun 16 '16 00:06 saiprabhakar

Ho can we identify which images in our dataset is causing this problem?

abhisheksgumadi avatar Aug 22 '16 16:08 abhisheksgumadi

I dont know the answer to how to identify which image is causing the problem yet, I will look into it.

The problem I had was, there was a internal swapping of x,y columns of bounding box array in either ground truth or region proposals (one of them I forgot which one), which I guess was correct for Pascal dataset. For Caltech this should not be done especially if your data arrangements are matching between ground truth and region proposals in the file stage. This decreased the number of positive boxes in the region proposal for me.

When i removed it, it worked just fine even without the adding dummy boxes as it was suggested earlier.

saiprabhakar avatar Aug 22 '16 20:08 saiprabhakar

I have solved this problem! For my case, there are annotation errors of my dataset. I'm training my network with RSOD dataset, which is a aerial image dataset and get the problem. The problem rises with a warning: "RuntimeWarning: invalid value encountered in log targets_dw = np.log(gt_heights / ex_heights)". I found that some ex_heights<0. That means there are some bad "y" annotations. Unfortunatly, no assertion for y is made in the original code. so i added some lines to lib/datasets/imdb.py to check if there are bad "y" annotations. def _get_heights(self): return [PIL.Image.open(self.image_path_at(i)).size[1] for i in xrange(self.num_images)] def append_flipped_images(self): num_images = self.num_images widths = self._get_widths() heights = self._get_heights()#add to get image height for i in xrange(num_images): boxes = self.roidb[i]['boxes'].copy() oldx1 = boxes[:, 0].copy() oldx2 = boxes[:, 2].copy() #print image name print self.image_index[i] #assert that ymin<=ymax assert (boxes[:,1]<=boxes[:,3]).all() #assert ymin>=0,for 0-based assert (boxes[:,1]>=0).all() #assert ymax<height[i],for 0-based assert (boxes[:,3]<heights[i]).all() #assert xmax<withd[i],for 0-based assert (oldx2<widths[i]).all() #assert xmin>=0, for 0-based assert (oldx1>=0).all() #assert xmax>=xmin, for 0-based assert (oldx2 >= oldx1).all() boxes[:, 0] = widths[i] - oldx2 - 1 boxes[:, 2] = widths[i] - oldx1 - 1 #print ("num_image:%d"%(i)) assert (boxes[:, 2] >= boxes[:, 0]).all() entry = {'boxes' : boxes, 'gt_overlaps' : self.roidb[i]['gt_overlaps'], 'gt_classes' : self.roidb[i]['gt_classes'], 'flipped' : True} self.roidb.append(entry) self._image_index = self._image_index * 2 run training codes again and get AssertError at the wrong annatation. 'print self.image_index[i]' can help to locate the image. Correct the annotation of the image and remove 'py-faster-rcnn/data/cache' to ensure our next running is from the very beginning. Repeatedly, find and correct all wrong annotations. At last, it worked just fine!

starxhong avatar Feb 01 '18 12:02 starxhong