ssds.pytorch icon indicating copy to clipboard operation
ssds.pytorch copied to clipboard

Loss is not decreasing

Open lucasjinreal opened this issue 6 years ago • 19 comments

I have trained ssd with mobilenetv2 on VOC but after almost 500 epochs, the loss is still like this:

517/518 in 0.154s [##########] | loc_loss: 1.4773 cls_loss: 2.3165

==>Train: || Total_time: 79.676s || loc_loss: 1.1118 conf_loss: 2.3807 || lr: 0.000721

Wrote snapshot to: ./experiments/models/ssd_mobilenet_v2_voc/ssd_lite_mobilenet_v2_voc_epoch_525.pth
Epoch 526/1300:
0/518 in 0.193s [----------] | loc_loss: 0.8291 cls_loss: 1.9464
1/518 in 0.186s [----------] | loc_loss: 1.3181 cls_loss: 2.5404
2/518 in 0.184s [----------] | loc_loss: 1.0371 cls_loss: 2.2243

It's doesn't change and loss is very hight...... What's the problem with implementation?

lucasjinreal avatar Nov 07 '18 09:11 lucasjinreal

did you load the pre-train weight? it works fine with my dataset

1453042287 avatar Nov 23 '18 02:11 1453042287

or maybe you didn't change the mode is train or test in the config file

1453042287 avatar Nov 28 '18 02:11 1453042287

@jinfagang Have you solved the problem? I have the same issue.

@1453042287 I trained the yolov2-mobilenet-v2 from stratch. U mentioned 'pre-trained model', do y mean the pre-trained bone network model (such as the mobilenetv2) or both bone model and detection model? In my training, all the parameters are not pre trained.

blueardour avatar Dec 03 '18 07:12 blueardour

@blueardour first, make sure you change the PHASE in .yml file to 'train', then ,actually, i believe it's inappropriate to train a model from scratch, so at least, you should load the pre-train backbone, i just utilize the whole pre-train weight(including backbone and extract and so on..) the author provided, but i set the RESUME_SCOPE in the .yml file to be 'base' only and the resault is almost the same as fine-tune's

1453042287 avatar Dec 03 '18 08:12 1453042287

@1453042287 Hi, thanks for the advise. My current training seems working. In my previous training, I set 'base' and 'loc' so on all in the trainable_scope, and it does not give a good result. After only reload the 'base' and retrain other parameters, I successfully recover the precision.

My only problem left is the speed for test. The nms in the test procedure seems very slow. It have been discussed in https://github.com/ShuangXieIrene/ssds.pytorch/issues/16. Yet no good solutions.

blueardour avatar Dec 05 '18 03:12 blueardour

@1453042287 Hi, thanks for the advise. My current training seems working. In my previous training, I set 'base' and 'loc' so on all in the trainable_scope, and it does not give a good result. After only reload the 'base' and retrain other parameters, I successfully recover the precision.

My only problem left is the speed for test. The nms in the test procedure seems very slow. It have been discussed in #16. Yet no good solutions.

@blueardour Hi,bellow is my test result of fssd_mobilenet_v2 on coco2017 using my config files instead of the given one. training from scratch without any pre-trained model. Shall i only reload the 'base' paras here?

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.211
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.358
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.217
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.044
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.234
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.351
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.216
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.343
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.371
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.099
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.428
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.590

cvtower avatar Dec 07 '18 06:12 cvtower

ok...seems like training from scratch might not be well supported. But i just want to use this repo to verify my network arch, and imagenet pre-trained model is still on training.

cvtower avatar Dec 07 '18 09:12 cvtower

Yes, set all parameter to re-trainable seems hard to converge. This year, Mr He did publish a paper named 'Rethinking ImageNet Pre-training' which claimed the pre-train on imagenet is not necessary. However, it is skillful to give a good initialization of the network.

blueardour avatar Dec 10 '18 01:12 blueardour

Yes, set all parameter to re-trainable seems hard to converge. This year, Mr He did publish a paper named 'Rethinking ImageNet Pre-training' which claimed the pre-train on imagenet is not necessary. However, it is skillful to give a good initialization of the network.

Yes, agree with you. I read that paper the day it is published. My own designed network outperform(imagenet/cifar...) several networks, however, the imagenet training is still going on(72.5 1.0). Also i have verified my network on other tasks and works fine, so i believe it will get better result on detection&&segmentation task too. Personally, i greatly agree with views from "Detnet" and "rethinking imagenet pre-training", however, seems like that much more computation cost and specific tuning skills are needed. Before my imagenet training finished, i will have to compare sdd performance based on models trained from scratch firstly.

cvtower avatar Dec 10 '18 02:12 cvtower

Hi, @1453042287 @cvtower

I have another issue about the train precision and loss curve. The following is the result from tensorboardX.

issue

It can be see that the precision slowly increase and meet a jump at around 89th epoch. I don't why the precision changes so dramatically at this point. The loc and cls loss as well the learning rate seem not change so much. Do you observe a similar phenomenon or do you have any explanation on it?

blueardour avatar Dec 12 '18 02:12 blueardour

Hi, @1453042287 @cvtower

I have another issue about the train precision and loss curve. The following is the result from tensorboardX.

issue

It can be see that the precision slowly increase and meet a jump at around 89th epoch. I don't why the precision changes so dramatically at this point. The loc and cls loss as well the learning rate seem not change so much. Do you observe a similar phenomenon or do you have any explanation on it?

Hi @blueardour,

I did not use the CosineAnnealing LR and no such phenomenon ever happened during training.

cvtower avatar Dec 12 '18 03:12 cvtower

您好,我想请问下:作者提供的pre-train weight文件,你是如何得到的,我没有weight目录,所以也没有预训练权重文件,还是您通过其他方式获得的?谢谢您! @1453042287

XiaSunny avatar Jan 20 '19 03:01 XiaSunny

@XiaSunny 下载啊。。。就在这个repo的readme里面,蓝体字

1453042287 avatar Jan 23 '19 12:01 1453042287

@1453042287 好的,谢谢你。

XiaSunny avatar Feb 26 '19 01:02 XiaSunny

您好,我用的配置文件是fssd_vgg16_train_coco.yml,当我训练coco2017时conf_loss在5左右,loc_loss在2左右,一直不下去。我的配置文件如下: MODEL: SSDS: fssd NETS: vgg16 IMAGE_SIZE: [300, 300] NUM_CLASSES: 81 FEATURE_LAYER: [[[22, 34, 'S'], [512, 1024, 512]], [['', 'S', 'S', 'S', '', ''], [512, 512, 256, 256, 256, 256]]] STEPS: [[8, 8], [16, 16], [32, 32], [64, 64], [100, 100], [300, 300]] SIZES: [[30, 30], [60, 60], [111, 111], [162, 162], [213, 213], [264, 264], [315, 315]] ASPECT_RATIOS: [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2], [1, 2]]

TRAIN: MAX_EPOCHS: 500 CHECKPOINTS_EPOCHS: 1 BATCH_SIZE: 28 TRAINABLE_SCOPE: 'norm,extras,transforms,pyramids,loc,conf' RESUME_SCOPE: 'base' OPTIMIZER: OPTIMIZER: sgd LEARNING_RATE: 0.001 MOMENTUM: 0.9 WEIGHT_DECAY: 0.0001 LR_SCHEDULER: SCHEDULER: SGDR WARM_UP_EPOCHS: 150

TEST: BATCH_SIZE: 64 TEST_SCOPE: [90, 100]

MATCHER: MATCHED_THRESHOLD: 0.5 UNMATCHED_THRESHOLD: 0.5 NEGPOS_RATIO: 3

POST_PROCESS: SCORE_THRESHOLD: 0.01 IOU_THRESHOLD: 0.6 MAX_DETECTIONS: 100

DATASET: DATASET: 'coco' DATASET_DIR: '/home/chase/Downloads/ssds.pytorch-master/data/coco' TRAIN_SETS: [['2017', 'train']] TEST_SETS: [['2017', 'val']] PROB: 0.6

EXP_DIR: './experiments/models/fssd_vgg16_coco' LOG_DIR: './experiments/models/fssd_vgg16_coco' RESUME_CHECKPOINT: '/home/chase/Downloads/ssds.pytorch-master/weight/vgg16_fssd_coco_27.2.pth' PHASE: ['train'] 另外,我还试了 RESUME_CHECKPOINT:vgg16_reducedfc.pth,但是效果差不多。这个问题困扰我很长时间了,我不知道怎么回事,希望你能指点一下 @1453042287 @blueardour @cvtower

XiaSunny avatar Mar 13 '19 07:03 XiaSunny

您好,我用的配置文件是fssd_vgg16_train_coco.yml,当我训练coco2017时conf_loss在5左右,loc_loss在2左右,一直不下去。我的配置文件如下: MODEL: SSDS: fssd NETS: vgg16 IMAGE_SIZE: [300, 300] NUM_CLASSES: 81 FEATURE_LAYER: [[[22, 34, 'S'], [512, 1024, 512]], [['', 'S', 'S', 'S', '', ''], [512, 512, 256, 256, 256, 256]]] STEPS: [[8, 8], [16, 16], [32, 32], [64, 64], [100, 100], [300, 300]] SIZES: [[30, 30], [60, 60], [111, 111], [162, 162], [213, 213], [264, 264], [315, 315]] ASPECT_RATIOS: [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2], [1, 2]]

TRAIN: MAX_EPOCHS: 500 CHECKPOINTS_EPOCHS: 1 BATCH_SIZE: 28 TRAINABLE_SCOPE: 'norm,extras,transforms,pyramids,loc,conf' RESUME_SCOPE: 'base' OPTIMIZER: OPTIMIZER: sgd LEARNING_RATE: 0.001 MOMENTUM: 0.9 WEIGHT_DECAY: 0.0001 LR_SCHEDULER: SCHEDULER: SGDR WARM_UP_EPOCHS: 150

TEST: BATCH_SIZE: 64 TEST_SCOPE: [90, 100]

MATCHER: MATCHED_THRESHOLD: 0.5 UNMATCHED_THRESHOLD: 0.5 NEGPOS_RATIO: 3

POST_PROCESS: SCORE_THRESHOLD: 0.01 IOU_THRESHOLD: 0.6 MAX_DETECTIONS: 100

DATASET: DATASET: 'coco' DATASET_DIR: '/home/chase/Downloads/ssds.pytorch-master/data/coco' TRAIN_SETS: [['2017', 'train']] TEST_SETS: [['2017', 'val']] PROB: 0.6

EXP_DIR: './experiments/models/fssd_vgg16_coco' LOG_DIR: './experiments/models/fssd_vgg16_coco' RESUME_CHECKPOINT: '/home/chase/Downloads/ssds.pytorch-master/weight/vgg16_fssd_coco_27.2.pth' PHASE: ['train'] 另外,我还试了 RESUME_CHECKPOINT:vgg16_reducedfc.pth,但是效果差不多。这个问题困扰我很长时间了,我不知道怎么回事,希望你能指点一下 @1453042287 @blueardour @cvtower

@XiaSunny 你好,我也遇到了你这个问题,请问你解决了吗

Damon2019 avatar Sep 17 '19 12:09 Damon2019

@1453042287 @XiaSunny 你好,我想使用预训练模型

TRAINABLE_SCOPE: 'base,norm,extras,loc,conf' RESUME_SCOPE: 'base,norm,extras,loc,conf' 这里面的参数我应该如何修改? 谢谢!

Damon2019 avatar Sep 18 '19 03:09 Damon2019

TRAINABLE_SCOPE指的是需要训练的范围RESUME_SCOPE指的是你需要从预训练模型中恢复的有哪些,首先应该把conf去掉(因为类别数不一样)其他的你根据实际情况看看还需要改不。发自我的华为手机-------- 原始邮件 --------发件人: Damon2019 [email protected]日期: 2019年9月18日周三 11:31收件人: "ShuangXieIrene/ssds.pytorch" [email protected]抄送: XiaSunny [email protected], Mention [email protected]主 题: Re: [ShuangXieIrene/ssds.pytorch] Loss is not decreasing (#43)@1453042287 @XiaSunny 你好,我想使用预训练模型

TRAINABLE_SCOPE: 'base,norm,extras,loc,conf'

RESUME_SCOPE: 'base,norm,extras,loc,conf'

这里面的参数我应该如何修改? 谢谢!

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.

XiaSunny avatar Dec 02 '19 01:12 XiaSunny

您好,我用的配置文件是fssd_vgg16_train_coco.yml,当我训练coco2017时conf_loss在5左右,loc_loss在2左右,一直不下去。我的配置文件如下: MODEL: SSDS: fssd NETS: vgg16 IMAGE_SIZE: [300, 300] NUM_CLASSES: 81 FEATURE_LAYER: [[[22, 34, 'S'], [512, 1024, 512]], [['', 'S', 'S', 'S', '', ''], [512, 512, 256, 256, 256, 256]]] STEPS: [[8, 8], [16, 16], [32, 32], [64, 64], [100, 100], [300, 300]] SIZES: [[30, 30], [60, 60], [111, 111], [162, 162], [213, 213], [264, 264], [315, 315]] ASPECT_RATIOS: [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2], [1, 2]]

TRAIN: MAX_EPOCHS: 500 CHECKPOINTS_EPOCHS: 1 BATCH_SIZE: 28 TRAINABLE_SCOPE: 'norm,extras,transforms,pyramids,loc,conf' RESUME_SCOPE: 'base' OPTIMIZER: OPTIMIZER: sgd LEARNING_RATE: 0.001 MOMENTUM: 0.9 WEIGHT_DECAY: 0.0001 LR_SCHEDULER: SCHEDULER: SGDR WARM_UP_EPOCHS: 150

TEST: BATCH_SIZE: 64 TEST_SCOPE: [90, 100]

MATCHER: MATCHED_THRESHOLD: 0.5 UNMATCHED_THRESHOLD: 0.5 NEGPOS_RATIO: 3

POST_PROCESS: SCORE_THRESHOLD: 0.01 IOU_THRESHOLD: 0.6 MAX_DETECTIONS: 100

DATASET: DATASET: 'coco' DATASET_DIR: '/home/chase/Downloads/ssds.pytorch-master/data/coco' TRAIN_SETS: [['2017', 'train']] TEST_SETS: [['2017', 'val']] PROB: 0.6

EXP_DIR: './experiments/models/fssd_vgg16_coco' LOG_DIR: './experiments/models/fssd_vgg16_coco' RESUME_CHECKPOINT: '/home/chase/Downloads/ssds.pytorch-master/weight/vgg16_fssd_coco_27.2.pth' PHASE: ['train'] 另外,我还试了 RESUME_CHECKPOINT:vgg16_reducedfc.pth,但是效果差不多。这个问题困扰我很长时间了,我不知道怎么回事,希望你能指点一下 @1453042287 @blueardour @cvtower

你好,我最近训练也遇到loss不下降的问题,一直维持在4左右,下载的模型,没做任何修改,只是重新加载base进行训练,求问你最终是如何解决的,万分感谢~

Bobby2090 avatar Feb 06 '21 05:02 Bobby2090