SSD-Tensorflow icon indicating copy to clipboard operation
SSD-Tensorflow copied to clipboard

Huge training loss while fine-tuning a pretrained SSD checkpoint on VOC 2007 dataset

Open derekhh opened this issue 7 years ago • 3 comments

Description

tl;dr: The pre-trained SSD checkpoint has a huge loss while training on the VOC2007 train + val dataset and it doesn't seem to be anywhere near convergence.

I was trying to use the pre-trained SSD checkpoint here to fine-tune the VOC 2007 train+val dataset. I understand this doesn't make real sense as the SSD model is already pre-trained on the VOC dataset but I want to try it out for a sanity check after cloning the repo.

Initially I followed this training script. The loss seems pretty huge ever after the initial step - which seems very weird. Then I noticed the learning rate probably is too high for fine-tuning and I changed it to something like 1e-5 and 1e-6. Still I'm observing huge training loss and it doesn't seem to be converging.

DATASET_DIR=./tfrecords
TRAIN_DIR=./logs/
CHECKPOINT_PATH=./checkpoints/ssd_300_vgg.ckpt
python train_ssd_network.py \
    --train_dir=${TRAIN_DIR} \
    --dataset_dir=${DATASET_DIR} \
    --dataset_name=pascalvoc_2012 \
    --dataset_split_name=train \
    --model_name=ssd_300_vgg \
    --checkpoint_path=${CHECKPOINT_PATH} \
    --save_summaries_secs=60 \
    --save_interval_secs=600 \
    --weight_decay=0.0005 \
    --optimizer=adam \
    --learning_rate=0.001 \
    --batch_size=32

image

I've then evaluated the SSD checkpoint quality on both the VOC07 train + val dataset and VOC07 test dataset to make sure the quality of the checkpoint is OK. I'm getting decent mAP values on both datasets so this confuses me a lot as I feel this should mean the loss is low while training on VOC07 train+val.

I tensorflow/core/kernels/logging_ops.cc:79] AP_VOC07/mAP[0.81998744666155454]
I tensorflow/core/kernels/logging_ops.cc:79] AP_VOC12/mAP[0.85548588208867282]
INFO:tensorflow:Finished evaluation at 2017-04-19-18:02:18
Time spent : 301.644 seconds.
Time spent per BATCH: 0.481 seconds.

Can @balancap help take a look? And thanks again for sharing this repo with us!

derekhh avatar Apr 19 '17 18:04 derekhh

I am getting similar results, but with a much higher training loss that never drops below 40 in some cases (even after running ~100k steps).

I am using all default scripts, running with:

python train_ssd_network.py --train_dir=/summary/ssd_300_pascal --dat_dir=/home/blanca/my_data/voc2007_tfrecords/ --checkpoint_path=./checkpoints/ssd_300_vgg.ckpt --checkpoint_exclude_scopes=ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box --dataset_name=pascalvoc_2007 --dataset_split_name=train --model_name=ssd_300_vgg --save_summaries_secs=60 --save_interval_secs=60 --weight_decay=0.0005 --learning_rate=0.001 --learning_rate_decay_factor=0.96 --batch_size=32 --gpu_memory_fraction=0.8 --num_classes=21

And yet I still achieve a high mAP on the training set (this is for sanity check purposes. I have not tested on Pascal VOC test data yet):

INFO:tensorflow:Evaluation [5011/5011]
I tensorflow/core/kernels/logging_ops.cc:79] AP_VOC07/mAP[0.82002322122807825]
I tensorflow/core/kernels/logging_ops.cc:79] AP_VOC12/mAP[0.855447542894266]
INFO:tensorflow:Finished evaluation at 2017-05-03-16:47:24
Time spent : 430.635 seconds.
Time spent per BATCH: 0.086 seconds.

Has anyone solved this issue?

villanuevab avatar May 03 '17 16:05 villanuevab

In general, regularization losses are not included in model evaluation hence losses in training phase highly overestimate

nlngh avatar Jul 16 '17 17:07 nlngh

Description

tl;dr: The pre-trained SSD checkpoint has a huge loss while training on the VOC2007 train + val dataset and it doesn't seem to be anywhere near convergence.

I was trying to use the pre-trained SSD checkpoint here to fine-tune the VOC 2007 train+val dataset. I understand this doesn't make real sense as the SSD model is already pre-trained on the VOC dataset but I want to try it out for a sanity check after cloning the repo.

Initially I followed this training script. The loss seems pretty huge ever after the initial step - which seems very weird. Then I noticed the learning rate probably is too high for fine-tuning and I changed it to something like 1e-5 and 1e-6. Still I'm observing huge training loss and it doesn't seem to be converging.

DATASET_DIR=./tfrecords
TRAIN_DIR=./logs/
CHECKPOINT_PATH=./checkpoints/ssd_300_vgg.ckpt
python train_ssd_network.py \
    --train_dir=${TRAIN_DIR} \
    --dataset_dir=${DATASET_DIR} \
    --dataset_name=pascalvoc_2012 \
    --dataset_split_name=train \
    --model_name=ssd_300_vgg \
    --checkpoint_path=${CHECKPOINT_PATH} \
    --save_summaries_secs=60 \
    --save_interval_secs=600 \
    --weight_decay=0.0005 \
    --optimizer=adam \
    --learning_rate=0.001 \
    --batch_size=32

image

I've then evaluated the SSD checkpoint quality on both the VOC07 train + val dataset and VOC07 test dataset to make sure the quality of the checkpoint is OK. I'm getting decent mAP values on both datasets so this confuses me a lot as I feel this should mean the loss is low while training on VOC07 train+val.

I tensorflow/core/kernels/logging_ops.cc:79] AP_VOC07/mAP[0.81998744666155454]
I tensorflow/core/kernels/logging_ops.cc:79] AP_VOC12/mAP[0.85548588208867282]
INFO:tensorflow:Finished evaluation at 2017-04-19-18:02:18
Time spent : 301.644 seconds.
Time spent per BATCH: 0.481 seconds.

Can @balancap help take a look? And thanks again for sharing this repo with us!

@derekhh Can you share your eval.sh script? I'm wondering if you evaluated the ckpt before training?

qianweilzh avatar Aug 24 '21 03:08 qianweilzh