SSD-Tensorflow icon indicating copy to clipboard operation
SSD-Tensorflow copied to clipboard

training voc2007

Open SeougnSeon opened this issue 7 years ago • 36 comments

Your work is very nice. I have a question about training. I trained voc_2007_train. I got a below total loss. image

Total loss does not converge. When I use caffe version ssd the loss is easily converge.

Did you converge loss for voc_2000_train? Detection result with trained model. image

SeougnSeon avatar Mar 24 '17 05:03 SeougnSeon

I add more train dataset with VOC2007 and VOC2012. The total loss also dose not converge. image image

I use ssd_300_vgg as pre-trained data(Fine-tuning existing SSD checkpoints) instead of vgg_16(Fine-tuning a network trained on ImageNet). I will try to train with this vgg_16. I think your work is very good to learn TF-slim.

SeougnSeon avatar Mar 27 '17 01:03 SeougnSeon

I also have this problem. Total loss is above 6.0 for long time.

chenweiqian avatar Mar 28 '17 01:03 chenweiqian

I got same problem with vgg_16(Fine-tuning a network trained on ImageNet) setting. image The loss is bigger than 300_vgg.

SeougnSeon avatar Mar 28 '17 02:03 SeougnSeon

I'm also having a problem converging with vgg_16. What mAP values did you achieve on evaluation?

edocoh87 avatar Mar 28 '17 11:03 edocoh87

My mAP is close to zero。What mAP values did you achieve?

chenweiqian avatar Mar 28 '17 16:03 chenweiqian

I didn't test evaluation set after training. It was used for temporary to check the code is working.

The detected image is from ipynb code.

If the detected image is correct My next work is evaluation mAP.

SeougnSeon avatar Mar 28 '17 23:03 SeougnSeon

I got the same problems of convergence. To help a bit, I used a fixed learning rate, but still the loss does not converge.

christopher5106 avatar Mar 29 '17 11:03 christopher5106

I got a map of 0.27 after a couple of days on a 4xGPU machine. Note that the reported results are over training of 2012+2007 (which I'm running at the moment)

edocoh87 avatar Mar 30 '17 13:03 edocoh87

I got the same problems of convergence, the global step is about 8000 but the loss is about 6 and mAP is 0.026. I have no idea about it.

zhyhan avatar Apr 01 '17 23:04 zhyhan

I got the same problem, either. I really want to train from imagenet-pretrained model

youngwanLEE avatar Apr 04 '17 02:04 youngwanLEE

@edocoh87 Did you train the ssd model using 4 GPUs?

youngwanLEE avatar Apr 04 '17 07:04 youngwanLEE

Yes, 4 Titan X

edocoh87 avatar Apr 04 '17 07:04 edocoh87

@edocoh87 could you share your train script?

youngwanLEE avatar Apr 04 '17 07:04 youngwanLEE

@edocoh87

DATASET_DIR=/home/ywlee/data/VOC2007_TFRecords
TRAIN_DIR=./logs/vgg_300_0404
CHECKPOINT_PATH=./checkpoints/vgg_16.ckpt
python train_ssd_network.py \
    --train_dir=${TRAIN_DIR} \
    --dataset_dir=${DATASET_DIR} \
    --dataset_name=pascalvoc_2007 \
    --dataset_split_name=train \
    --model_name=ssd_300_vgg \
    --checkpoint_path=${CHECKPOINT_PATH} \
    --checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \
    --checkpoint_model_scope=vgg_16 \
    --save_summaries_secs=60 \
    --save_interval_secs=600 \
    --weight_decay=0.0005 \
    --optimizer=rmsprop \
    --learning_rate=0.005 \
    --num_epochs_per_decay=10 \
    --batch_size=32 \
    --max_number_of_steps=200000 \
    --num_clones=4

When I designated the num_clones=4 argument in the command script, I got this error.

InvalidArgumentError (see above for traceback): Cannot assign a device to node 'clone_3/ssd_losses/block_5/cross_entropy_neg/SparseSoftmaxCrossEntropyWithLogits/Gather': Could not satisfy explicit device specification '/device:GPU:3' because no supported kernel for GPU devices is available.
         [[Node: clone_3/ssd_losses/block_5/cross_entropy_neg/SparseSoftmaxCrossEntropyWithLogits/Gather = Gather[Tindices=DT_INT32, Tparams=DT_INT32, validate_indices=true, _device="/device:GPU:3"](clone_3/ssd_losses/block_5/cross_entropy_neg/SparseSoftmaxCrossEntropyWithLogits/Shape_1, clone_3/ssd_losses/block_5/cross_entropy_neg/SparseSoftmaxCrossEntropyWithLogits/sub)]]

Could you let me know how to set multi-GPU ?

youngwanLEE avatar Apr 04 '17 07:04 youngwanLEE

I also got the same problem. Has anyone solved it?

paolutan avatar Apr 05 '17 09:04 paolutan

I have same problem. Anyone have solution?

ggookey123 avatar Apr 07 '17 12:04 ggookey123

I am currently experimenting on how to fix the training. I set up a special branch fix_training. A few things I have noticed until now:

  • use a very simple data pre-processing pipeline at beginning to test out the training script;
  • use the argument trainable_scopes for only training the new parts of the network. Then, in a second time, fine tune the full network.

I also change the loss function to be copy completely the setting of SSD Caffe.

balancap avatar Apr 09 '17 18:04 balancap

I kept training 80000 steps (fine tuning based on ssd_300_vgg.ckpt). And I found that although the loss is between 3.0 and 6.0 at the most of time, but the mAP is kept increasing. At the end, I achieved 70% mAP in VOC07 and 72% mAP in VOC12.

I wonder this training process just converge very slowly, while is correct.

paolutan avatar Apr 10 '17 03:04 paolutan

@ithink2 Thanks for the testing. I am working on fixing this training problem, aiming to get at least ~0.7 mAP starting from the VGG weights. Things are getting a bit better (you can have a look at the fix_training branch). I implement an hard mining which is equivalent to the SSD Caffe, and looking at how to improve the data augmentation part.

balancap avatar Apr 10 '17 06:04 balancap

I got this training result with vgg-16(fine-tuning a network trained on ImageNet) on VOC07 for 4 days using 1GPU.

image image

train script :

DATASET_DIR=/home/ywlee/data/VOC2007_TFRecords
CHECKPOINT_PATH=./checkpoints/vgg_16.ckpt
python train_ssd_network.py \
    --train_dir=${TRAIN_DIR} \
    --dataset_dir=${DATASET_DIR} \
    --dataset_name=pascalvoc_2007 \
    --dataset_split_name=train \
    --model_name=ssd_300_vgg \
    --checkpoint_path=${CHECKPOINT_PATH} \
    --checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \
    --checkpoint_model_scope=vgg_16 \
    --save_summaries_secs=60 \
    --save_interval_secs=600 \
    --weight_decay=0.0005 \
    --optimizer=rmsprop \
    --learning_rate=0.001 \
    --num_epochs_per_decay=200 \
    --batch_size=32 \
    --learning_rate_decay_factor=0.94 

evaluation script :

TRAIN_DIR=/home/ywlee/SSD-Tensorflow/logs/vgg_300_0405/model.ckpt-468031
DATASET_DIR=/home/ywlee/data/VOC2007_TFRecords
EVAL_DIR=${TRAIN_DIR}/eval
python eval_ssd_network.py \
    --eval_dir=${EVAL_DIR} \
    --dataset_dir=${DATASET_DIR} \
    --dataset_name=pascalvoc_2007 \
    --dataset_split_name=test \
    --model_name=ssd_300_vgg \
    --checkpoint_path=${TRAIN_DIR} \
    --batch_size=1

But the mAPs are

I tensorflow/core/kernels/logging_ops.cc:79] AP_VOC07/mAP[0.00016903662625678594]
I tensorflow/core/kernels/logging_ops.cc:79] AP_VOC12/mAP[1.9445653552220332e-05]

I could't figure out why mAPs are so low.

youngwanLEE avatar Apr 10 '17 07:04 youngwanLEE

Hello,@balancap! Could you mind to tell me how to restore the pretrained model in 'checkpoint_path', when the 'checkpoint_path' in your train_ssd_network.py seem to be unused after its declaration?

SunAriesCN avatar Apr 10 '17 12:04 SunAriesCN

@SunAriesCN see line 378 in train_ssd_network.py; in particular, see init_fn=tf_utils.get_init_fn(FLAGS). get_init_fn() in line 186 of tf_utils.py loads the latest checkpoint. There should also be an INFO TF logging/print statement to sanity check that get_init_fn() loaded the correct checkpoint.

@balancap Can you share your learning_rate TensorBoard chart as well as total_loss TensorBoard chart? I am experiencing similar behavior, not converging on training data even after 2+ days.

villanuevab avatar Apr 25 '17 18:04 villanuevab

@villanuevab thank you, and I found this. But I just find others problem about ssd losses function , which the divisor under the smoothL1 and softmax loss seem not to be the number of matched default boxes as the paper's, and it's just a batch size.....Can someone tell me the reason about it?

SunAriesCN avatar Apr 27 '17 01:04 SunAriesCN

@SeougnSeon @edocoh87 @balancap just checking to check the latest status on training using this codebase. I am going to try to go through the code and fix issues, but wanted to check here before i spend time.

siddharthm83 avatar Jun 05 '17 19:06 siddharthm83

I potentially found 1 bug, still not helping training though. The matching of anchor box with ground truth box has a bug: https://github.com/balancap/SSD-Tensorflow/blob/master/nets/ssd_common.py#L113-L114 @balancap why is it -0.5? Shouldnt line 113 be correct but you have commented it out. The matching strategy is also different from the paper, the paper ensures that each gt box has atleast 1 matched anchor, i couldnt find this in your code although I should still expect the loss to converge independent of this. Any thoughts welcome, in the meantime, I'll keep digging.

siddharthm83 avatar Jun 08 '17 01:06 siddharthm83

@siddharthm83 , Based on Paul's great code implementations, I made some changes, and was able to make the training process work to some degree.

train_eval

total_loss

The SSD model is initialized with VGG 16 weights trained on ImageNet. Training data is VOC 2007 and 2012 train_val, Testing data is VOC 2007 test, Final test accuracy is 0.65

If you are interested, you can see here for more details.

LevinJ avatar Jun 29 '17 07:06 LevinJ

@LevinJ Good job! Could you please list what change have you make?

Zehaos avatar Jun 29 '17 07:06 Zehaos

Sure @Zehaos . I listed the major changes I made in the Experimentation section of this link

LevinJ avatar Jun 29 '17 07:06 LevinJ

@LevinJ Very clear! Thanks.

Zehaos avatar Jun 29 '17 07:06 Zehaos

@LevinJ can you teach me how to train my own data(thousands of pictures but will detect only one object from them)。I follow balancap‘s fine-tuning method trained based on pretrained vgg-16,but the loss cannot conerge (also nearby 4.0).

### what’s the right way to train my own data and use it to detect my own object?

”./tfrecords/voc2007“ is the path I created with my own data(1920x1080)。 my train script is that:

DATASET_DIR=./tfrecords/voc2007

TRAIN_DIR=./logs/my_chkp

CHECKPOINT_PATH=./checkpoints/vgg_16.ckpt

python3.4 train_ssd_network.py \
    --train_dir=${TRAIN_DIR} \
    --dataset_dir=${DATASET_DIR} \
    --dataset_name=pascalvoc_2007 \
    --dataset_split_name=train \
    --model_name=ssd_300_vgg \
    --checkpoint_path=${CHECKPOINT_PATH} \
    --checkpoint_model_scope=vgg_16 \
    --checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \
    --trainable_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \
    --save_summaries_secs=60 \
    --save_interval_secs=600 \
    --weight_decay=0.0005 \
    --optimizer=adam \
    --learning_rate=0.001 \
    --learning_rate_decay_factor=0.94 \
    --batch_size=64

seasonyang avatar Aug 18 '17 03:08 seasonyang