dilation icon indicating copy to clipboard operation
dilation copied to clipboard

Training loss

Open maolin23 opened this issue 8 years ago • 16 comments

Hi,

Could you tell me about the loss during training? When I use your code to train front-end model, my loss is about 2、3 in the first 15 iteration. After that, my loss increase to 50 ~ 80 and be stabled. After 20K iterations, it is still about 60~80. I'm not sure the correctness of this situation.... Could you tell me this situation is normal or not? What loss is correct? (My training/testing input images are all original and I don't change anything in the train.py.)

Thanks a lot, Mao

maolin23 avatar Jul 22 '16 09:07 maolin23

@maolin23 Have you solved the problem. I have also met this problem and I have tried some different batch_size and iter_size. Sometimes the loss changes as you say and sometimes it changes normally. More specifically, when iter_size is 1, the loss changes normally of high probability and when iter_size is larger, loss always changes abnormally.

lhao0301 avatar Aug 25 '16 13:08 lhao0301

The loss of initial stages should be around 3.0 for a 19 category classification problem. If you observe something much bigger than that, it probably indicates the optimization diverges. It is hard to diagnose the exact problems without more information. But if you are using the parameters and datasets described in the dilation paper, it is unlikely to happen.

fyu avatar Aug 26 '16 00:08 fyu

@fyu I train the frontend net on vgg_conv.caffemodel for initialization and only change batch_size to 8 for my limited GPU. It still diverges sometimes.

lhao0301 avatar Aug 26 '16 04:08 lhao0301

I got the same problem with batch size 8 but better with batch size 7. Why is there big difference with subtle batch size change?

jgong5 avatar Jan 09 '17 06:01 jgong5

I just added an option to set iter_size in the training options: https://github.com/fyu/dilation/blob/master/train.py#L233. If your GPU doesn't have enough memory and you have to decrease the batch size, you can try to increase iter_size.

fyu avatar Jan 09 '17 08:01 fyu

@maolin23 @jgong5 @fyu have you solved the problem? I also met this problem, after I changed the finaly layer with 'xavier' initialization, the loss seems better. But I have not finished my training process.

austingg avatar Mar 01 '17 10:03 austingg

No. I gave up eventually and turned to Berkeley’s FCN. You mean your change can get the network converge eventually?

From: Yubin Wang [mailto:[email protected]] Sent: Wednesday, March 01, 2017 6:12 PM To: fyu/dilation [email protected] Cc: Gong, Jiong [email protected]; Mention [email protected] Subject: Re: [fyu/dilation] Training loss (#12)

@maolin23https://github.com/maolin23 @jgong5https://github.com/jgong5 @fyuhttps://github.com/fyu have you solved the problem? I also met this problem, after I changed the finaly layer with 'xavier' initialization, the loss seems better. But I have not finished my training process.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/fyu/dilation/issues/12#issuecomment-283299426, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AH-NN-Otle7-vbYsxnNFkyXxNRBmuUwvks5rhUR3gaJpZM4JSmym.

jgong5 avatar Mar 01 '17 14:03 jgong5

@jgong5 Unfortunately, the loss becomes bigger after 200 iteration with batchsize 6.

austingg avatar Mar 02 '17 08:03 austingg

@jgong5 @fyu After I initialized the net with vgg_conv and initialization weights with xavier, the training loss seems better, and going down as the iteration increase, after 3w iteration, the training loss is about 10^-6. However, the test accuracy is always -nan, and the test results are all black. I train with my custom dataset.

austingg avatar Mar 03 '17 07:03 austingg

@maolin23 @TX2012LH @jgong5 @austingg same problem as you guys. I just train the net in VOC07(less than 500 images), it's quite weired the net fail to converge since the dataset is so small... However, it seems that the author stop offering supports now.

huangh12 avatar Apr 21 '17 10:04 huangh12

Hi @fyu I have run the training of VGG front-end model based on the documentation you have provided. However, the loss appears to be diverging very soon as you can see in this log. I have cross-verified the hyper-parameters you have mentioned in the paper against the ones written in the documentation and they seem to be matching. Same divergence issue can be seen with the joint training.

I am running your code using cuda-8.0 and cudnn-5. Can you kindly run your demo from scratch and tell us where the issue might be? A lot of people here seem to be facing the same issue.

Thanks!

ice-pice avatar Jun 12 '17 10:06 ice-pice

Is label one channel or RGB channel?

xingbotao avatar Jun 26 '17 07:06 xingbotao

one channel

fyu avatar Jun 26 '17 08:06 fyu

Hi, @fyu . Thank you for yours excellent codes. I met a problem that, when I use the trained models (the loss near 2) and the test_net.txt (frontend or joint ) to do the prediction for a figure, the resulting figure is always black and there is nothing on this figure.
Is there anything I need to do before the prediction? Thanks ahead

HXKwindwizard avatar Jul 25 '17 11:07 HXKwindwizard

@HXKwindwizard If loss is 2, it is a bad sign, saying that the model is not working properly. Probably your data is too different from what the model was trained on. It may solve the problem to train the model on your data.

fyu avatar Jul 25 '17 11:07 fyu

@fyu thanks for your reminding. I use the pascal voc datasets and funtune based on vgg that you suggested. I have done several trainings based on this data. When the loss is sometimes arount 10, the situation I mentioned above still exists. So I wonder, evenif the traing is not good, the prediction resulting figure can not be always black. Yet I use your trained model to do the prediciton, the reult is quite good. Is there any relationship with the network structure ? (I use the test.net to serve as the prototxt).

HXKwindwizard avatar Jul 25 '17 11:07 HXKwindwizard