hed
hed copied to clipboard
Loss is zero or nan during training on another dataset.
Hello.
While training your network on different dataset I get zero loss or nan after first couple of iterations.
Do you know why it happens?
I also can't get how the ground truth in your dataset is marked: I notice values between 0 and 255. Bigger number means stronger edge? Is it ok to use this labels with sigmoid cross entropy loss?
How should I label my own dataset?
Thank you.
I also get this problem with my own dataset. The only way I could get the loss to stop exploding was to use learning rates of 1*10e-8, but I feel like this is getting way too small...I'd be interested to know what other people had to do to get this working on their other datasets.
@extragoya I also solved the problem by decreasing the learning rate. Seemed to make it work.
smaller learning rate may be helps
hi, I also to use learning rates of 1*10e -8,but the value of loss id very big even Twenty thousand.what is your learning rate?thank you.
@cchenzhou - Don't worry about bigger loss value. It means nothing from my experience. The results should show up as expected. Just make sure that loss comes down slowly.
@codecolony thank you very much.When the iteration equals 100000 the training is stop,but the loss value equals 22758.3.I can understand why.Any idea about how to fix the error?
@brisker It clearly tells you your ground-truth size doesn't match corresponding image size, because ImageLabelMapDataLayer cannot open the image file. You may need to check whether the image exists.
@brisker Notice the error says: "Could not open or find file ../../data/HED-BSDS/train/aug_gt_scale_0.5/157.5_1_0/159045.png" - the problem is it can't find an image in your training list. The image_labelmap_data_layer will keep running, however, and then throw the final error when the sizes don't match. I would avoid the relative paths and make sure there are no spaces or anything else funny in your training list.
@brisker For one reason or the other, it's unable to open the file. I would assume that something is wrong with the path - what do you have as root_folder under image_data_param in your train_val.prototxt?
@extragoya At the begining, it was "../../data/HED-BSDS/", and error occurs. Then, I replace it with "/home/jcc/code/hed/data/HED-BSDS/" Still error...
@extragoya I think the error clearly shows that the code has found the train_pair.lst, so why can not find the images?
@brisker The path to the train_pair.lst is given in your train_val prototx file, whereas the paths to the images are given in train_pair.lst and your root_folder - so the path to the list can be correct whereas the paths to the images could be incorrect. Also, train_pair.lst is a text file, so it is opened differently than an image. Do you have opencv installed, and did you turn opencv on in your makefile.config?
@brisker
Could you plz give me the situation about the directories, including where you run solve.py
(by default you should run solve.py
in hed/examples/hed/
), and where you put your dataset, and an example line of your train_pair.lst.
Further more, please check the permission of the dataset, do you have reading permision ?
@extragoya I have opencv 3 installed, but I can not see any flags indicating whether turning opencv on in the makefile.config here: https://github.com/s9xie/hed/blob/master/Makefile.config.example I think I got the opencv well compiled with caffe.
@cchenzhou In my experience the loss is of little impact, the loss value is still big even model reach convergence. I suggest you to test your model on validation set. Do you use your own dataset? NOTE the positive/negative ration because SoftmaxLoss in HED will count adjust the positive/negative ratio.
@zeakey I run solve.py in hed/examples/hed/. and the data folder is in /home/jcc/code/hed ,just the root folder. The line of train.lst seems like train/aug_gt_scale_0.5/157.5_1_0/159045.png
@brisker Ok, later versions of caffe have an option to turn OPEN_CV off and also to specify what version you're using: https://github.com/BVLC/caffe/blob/master/Makefile.config.example. It may not apply with HED's version of caffe, but that is why I asked.
@extragoya @zeakey Thanks a lot for your replies! No other advice?
Set up breakpoint and debug into the code to see details. 在 2017年3月7日,上午12:08,numberjcc <[email protected]mailto:[email protected]> 写道:
@extragoyahttps://github.com/extragoya @zeakeyhttps://github.com/zeakey Thanks a lot for your replies! No other advice?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/s9xie/hed/issues/12#issuecomment-284442890, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGcyRnj8feLnrXXcJFJO2Jx9xaSJwk2Eks5rjC-IgaJpZM4H3HH8.
@zeakey @brisker Agreed. I would add that you should try to determine if the problem is whether the code cannot find the image, or can find the image but cannot load it.
@zeakey I have been tested my model on test set,but the result is very bad.I can't see any contour on the output image,so I think my model don't reach convergence.I use their BSDS500 and only change the learning rate.Could you tell me any parameters should be change or adjust the positive/negative ratio?Thank you!
@cchenzhou - What is the learning rate you're using?
@codecolony @zeakey @extragoya Hi, the modified SigmoidCrossEntropy Loss layer has a line that reads like -:
bottom_diff[i * dim + j] *= 1 * count_neg / (count_pos + count_neg);
why not
bottom_diff[i * dim + j] = 1 * count_neg / (count_pos + count_neg);
? It seems that the gradients are multiplied every iteration in the loops. It is a little confusing to me. Why does the author write like that?
@cchenzhou Hi, My model can't reach convergence too. How about yours? My label map is a binary image(0-255) with one channel. Do you have any advice? Tank you!