hed icon indicating copy to clipboard operation
hed copied to clipboard

Loss is zero or nan during training on another dataset.

Open warmspringwinds opened this issue 8 years ago • 24 comments

Hello.

While training your network on different dataset I get zero loss or nan after first couple of iterations.

Do you know why it happens?

I also can't get how the ground truth in your dataset is marked: I notice values between 0 and 255. Bigger number means stronger edge? Is it ok to use this labels with sigmoid cross entropy loss?

How should I label my own dataset?

Thank you.

warmspringwinds avatar Mar 23 '16 14:03 warmspringwinds

I also get this problem with my own dataset. The only way I could get the loss to stop exploding was to use learning rates of 1*10e-8, but I feel like this is getting way too small...I'd be interested to know what other people had to do to get this working on their other datasets.

extragoya avatar May 05 '16 17:05 extragoya

@extragoya I also solved the problem by decreasing the learning rate. Seemed to make it work.

warmspringwinds avatar May 05 '16 17:05 warmspringwinds

smaller learning rate may be helps

zeakey avatar May 12 '16 05:05 zeakey

hi, I also to use learning rates of 1*10e -8,but the value of loss id very big even Twenty thousand.what is your learning rate?thank you.

cchenzhou avatar Feb 24 '17 03:02 cchenzhou

@cchenzhou - Don't worry about bigger loss value. It means nothing from my experience. The results should show up as expected. Just make sure that loss comes down slowly.

codecolony avatar Feb 24 '17 06:02 codecolony

@codecolony thank you very much.When the iteration equals 100000 the training is stop,but the loss value equals 22758.3.I can understand why.Any idea about how to fix the error?

cchenzhou avatar Feb 24 '17 08:02 cchenzhou

@brisker It clearly tells you your ground-truth size doesn't match corresponding image size, because ImageLabelMapDataLayer cannot open the image file. You may need to check whether the image exists.

zeakey avatar Mar 06 '17 15:03 zeakey

@brisker Notice the error says: "Could not open or find file ../../data/HED-BSDS/train/aug_gt_scale_0.5/157.5_1_0/159045.png" - the problem is it can't find an image in your training list. The image_labelmap_data_layer will keep running, however, and then throw the final error when the sizes don't match. I would avoid the relative paths and make sure there are no spaces or anything else funny in your training list.

extragoya avatar Mar 06 '17 15:03 extragoya

@brisker For one reason or the other, it's unable to open the file. I would assume that something is wrong with the path - what do you have as root_folder under image_data_param in your train_val.prototxt?

extragoya avatar Mar 06 '17 15:03 extragoya

@extragoya At the begining, it was "../../data/HED-BSDS/", and error occurs. Then, I replace it with "/home/jcc/code/hed/data/HED-BSDS/" Still error...

brisker avatar Mar 06 '17 15:03 brisker

@extragoya I think the error clearly shows that the code has found the train_pair.lst, so why can not find the images?

brisker avatar Mar 06 '17 15:03 brisker

@brisker The path to the train_pair.lst is given in your train_val prototx file, whereas the paths to the images are given in train_pair.lst and your root_folder - so the path to the list can be correct whereas the paths to the images could be incorrect. Also, train_pair.lst is a text file, so it is opened differently than an image. Do you have opencv installed, and did you turn opencv on in your makefile.config?

extragoya avatar Mar 06 '17 15:03 extragoya

@brisker Could you plz give me the situation about the directories, including where you run solve.py(by default you should run solve.py in hed/examples/hed/), and where you put your dataset, and an example line of your train_pair.lst.

Further more, please check the permission of the dataset, do you have reading permision ?

zeakey avatar Mar 06 '17 15:03 zeakey

@extragoya I have opencv 3 installed, but I can not see any flags indicating whether turning opencv on in the makefile.config here: https://github.com/s9xie/hed/blob/master/Makefile.config.example I think I got the opencv well compiled with caffe.

brisker avatar Mar 06 '17 15:03 brisker

@cchenzhou In my experience the loss is of little impact, the loss value is still big even model reach convergence. I suggest you to test your model on validation set. Do you use your own dataset? NOTE the positive/negative ration because SoftmaxLoss in HED will count adjust the positive/negative ratio.

zeakey avatar Mar 06 '17 15:03 zeakey

@zeakey I run solve.py in hed/examples/hed/. and the data folder is in /home/jcc/code/hed ,just the root folder. The line of train.lst seems like train/aug_gt_scale_0.5/157.5_1_0/159045.png

brisker avatar Mar 06 '17 15:03 brisker

@brisker Ok, later versions of caffe have an option to turn OPEN_CV off and also to specify what version you're using: https://github.com/BVLC/caffe/blob/master/Makefile.config.example. It may not apply with HED's version of caffe, but that is why I asked.

extragoya avatar Mar 06 '17 15:03 extragoya

@extragoya @zeakey Thanks a lot for your replies! No other advice?

brisker avatar Mar 06 '17 16:03 brisker

Set up breakpoint and debug into the code to see details. 在 2017年3月7日,上午12:08,numberjcc <[email protected]mailto:[email protected]> 写道:

@extragoyahttps://github.com/extragoya @zeakeyhttps://github.com/zeakey Thanks a lot for your replies! No other advice?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/s9xie/hed/issues/12#issuecomment-284442890, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGcyRnj8feLnrXXcJFJO2Jx9xaSJwk2Eks5rjC-IgaJpZM4H3HH8.

zeakey avatar Mar 06 '17 16:03 zeakey

@zeakey @brisker Agreed. I would add that you should try to determine if the problem is whether the code cannot find the image, or can find the image but cannot load it.

extragoya avatar Mar 06 '17 16:03 extragoya

@zeakey I have been tested my model on test set,but the result is very bad.I can't see any contour on the output image,so I think my model don't reach convergence.I use their BSDS500 and only change the learning rate.Could you tell me any parameters should be change or adjust the positive/negative ratio?Thank you!

cchenzhou avatar Mar 07 '17 13:03 cchenzhou

@cchenzhou - What is the learning rate you're using?

codecolony avatar Mar 19 '17 10:03 codecolony

@codecolony @zeakey @extragoya Hi, the modified SigmoidCrossEntropy Loss layer has a line that reads like -:

bottom_diff[i * dim + j] *= 1 * count_neg / (count_pos + count_neg);

why not

 bottom_diff[i * dim + j] = 1 * count_neg / (count_pos + count_neg);

? It seems that the gradients are multiplied every iteration in the loops. It is a little confusing to me. Why does the author write like that?

brisker avatar Mar 19 '17 10:03 brisker

@cchenzhou Hi, My model can't reach convergence too. How about yours? My label map is a binary image(0-255) with one channel. Do you have any advice? Tank you!

dxytz avatar Mar 26 '17 08:03 dxytz