SSD-Tensorflow icon indicating copy to clipboard operation
SSD-Tensorflow copied to clipboard

Training on new data set

Open Peilin-Yang opened this issue 7 years ago • 29 comments

Hi, First, thanks for your hard work to implement SSD on TensorFlow, BIG CREDIT! Now I want to train my own model on another data set.

  1. It only contains 2 classes: Background (0) and the foreground object (1).
  2. The foreground objects in general are small (it might just occupy less than 5% of the whole image). I created my own dataset under ./dataset and started the training with appropriate num_classes and no_annotation_label (both set to 2). It seems that the training does not work... the loss is above 1.0 for very long time (1 day)... I am wondering that whether I need to change other parameters to make it work. Any suggestions? Thanks,

Peilin-Yang avatar Feb 27 '17 18:02 Peilin-Yang

Hi, thanks :) Took me a bit of time to make it, but I guess it is worth it!

The training script is not entirely stable. You may want to try to change a bit the loss function in ssd_vgg_300.py, and in particular the alpha and negative_ratio parameters.

Did you use a pre-trained checkpoint for training or did you train from scratch? I guess in the later case, it may be quite hard to converge. I just added to the readme a small description of how to fine-tune using VGG weights. I hope it can help you.

balancap avatar Feb 27 '17 21:02 balancap

Thanks for the additional info!

I guess for the checkpoint file I need to make it as ssd_vgg_300?

Peilin-Yang avatar Feb 27 '17 22:02 Peilin-Yang

Do you mean rename it? You just need to download the checkpoint of the VGG-16 model, and use in the training command. Hopefully it should work!

balancap avatar Feb 28 '17 09:02 balancap

Hey, thanks for sharing this implementation!

I have few questions. Have you trained SSD300 on VOC data initialized from VGG16 feature extractor in your implementation? The parameters in checkpoint looks like converted version of caffe parameters. (correct me if I'm wrong) I'm running a simple experiment. SSD300 VOC2012 dataset initialized from ssd_300_vgg.ckpt and after 100 training steps the prediction results are worse than those I get from ssd_300_vgg.ckpt checkpoint.

Thanks.

taras-sereda avatar Mar 01 '17 18:03 taras-sereda

Hello,

Yes, the checkpoints are directly converted from the Caffe implementation. The training script is not yet as advanced as the latter one, which explains your results (I got that too). I'll try to investigate a bit more how to improve it, including

  • data augmentation: my pre-processing is a bit rough;
  • check the loss function: I just implemented what is described in the paper. I'll look if the Caffe code is a bit different on that.
  • Hyper-parameters ! :)

Tell me if you have any ideas how to make it better!

balancap avatar Mar 02 '17 13:03 balancap

Hi, thanks for the answer.

I have few questions in your code:

  • variance for encoding bounding boxes: nets.ssd_common.tf_ssd_bboxes_encode_layer
    feat_cy = (feat_cy - yref) / href / prior_scaling[0]
    feat_cx = (feat_cx - xref) / wref / prior_scaling[1]
    feat_h = tf.log(feat_h / href) / prior_scaling[2]
    feat_w = tf.log(feat_w / wref) / prior_scaling[3]

Why is it necessary to scale the values in this way?

  • negative Jacard values. I assume they present because of no_label boxes, right? If so, how exactly you use no label boxes? Is it an approach for passing frames on which there are no classes we would like to detect, with randomly generated bounding boxes having no_label class? Right?
        mask = tf.greater(jaccard, feat_scores)
        # mask = tf.logical_and(mask, tf.greater(jaccard, matching_threshold))
        mask = tf.logical_and(mask, feat_scores > -0.5)
        mask = tf.logical_and(mask, label < num_classes)
        imask = tf.cast(mask, tf.int64)
        fmask = tf.cast(mask, dtype)

Thanks for the thoughts and advises.

taras-sereda avatar Mar 02 '17 16:03 taras-sereda

Hi ! For the scaling, the idea is try to scale such that all error terms (classification + position + size) have roughly the same scaling. Otherwise, the training would tend to over-optimise one component and not the others.

Exactly, the negative values are used to mark the anchors with no annotations. The idea comes from the KITTI dataset where some part of the dataset images are signaled as being not labelled : there may be a car/person/... in these parts, but it has not been segmented. If you don't keep track of these parts, you may end up with the SSD model detecting objects not annotated, and the loss function thinking it is False positive, and pushing for not detecting it. Which is not really what we want ! So basically, I set up a mask such that the loss function ignores the anchors which overlap too much with parts of images no-annotated. Hope it is a bit more clear! I guess I should add a bit of documentation about that!

balancap avatar Mar 03 '17 10:03 balancap

Hey! Thanks for explanation!

No annotation label used to avoid false positive on the image regions while training. It's clear.

I'm working on training SSD with empty frames, covered with background only. And as far as I understand, it would be sufficient to supply empty lists of labels and bboxes. Which should result in contribution to loss only for negative xentropy part.

Have you tried to training in this setting?

taras-sereda avatar Mar 09 '17 17:03 taras-sereda

@Peilin-Yang I train my own model on widerface dataset. It only contains 2 classes: Background (0) and the face object (1). But I have a big problem: image I don‘t know why. How do you modify the code?Thanks!

chenweiqian avatar Mar 29 '17 14:03 chenweiqian

@chenweiqian Hi, honestly I had the similar result as yours:(

Peilin-Yang avatar Apr 03 '17 17:04 Peilin-Yang

@Peilin-Yang what's your loss? My loss is always above 5 and mAP is close to zero. Is my dataset wrong?

chenweiqian avatar Apr 08 '17 11:04 chenweiqian

@chenweiqian Yeah...I think mine was similar to yours. Sorry I do not know how to help on this issue since I am not an expert.

Peilin-Yang avatar Apr 09 '17 00:04 Peilin-Yang

@Peilin-Yang Thanks for your answer!

chenweiqian avatar Apr 09 '17 00:04 chenweiqian

I have similar errors to @chenweiqian and @Peilin-Yang when training on KITTI dataset (importing the KITTI interface written by @balancap from SDC-Vehicle-Detection/datasets/kitti*).

screen shot 2017-04-26 at 11 13 43 am

The mAP I get evaluating using eval_ssd_network.py on KITTI is ~31%.

Low prediction scores and poor localization can be seen in this sample image: screen shot 2017-05-01 at 12 17 57 pm

villanuevab avatar May 01 '17 19:05 villanuevab

How did you come up with the prior scaling values?

blake-varden avatar May 12 '17 17:05 blake-varden

@Peilin-Yang I have issue similar to you。 1、only two classes,background(0) and my object(1),my object is very small (average 5% of the whole img)。 2、my loss cannot converge ,always above 4.0 。

have you solved it? can you tell me the way?

seasonyang avatar Aug 18 '17 06:08 seasonyang

@seasonyang @Peilin-Yang @balancap This SSD project is so good, I learn a lot in it. However, I met some problems when training my own dataset. I also have 2 classes, One is "ID number", the other is "Name". I then change the tag in "ssd_common" and make my own dataset explanation python script on /dataset. Everything go smooths if I don't change the num_classes, at least the code can run. However, if I change the num_classes to 2, then the error pop up claims that "something wrong about the tensor shape", here is part of what I get:

Caused by op u'save_1/Assign_4', defined at: File "train_ssd_network.py", line 402, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "train_ssd_network.py", line 390, in main init_fn=tf_utils.get_init_fn(FLAGS), File "/ssd_tensorflow/tf_utils.py", line 239, in get_init_fn ignore_missing_vars=flags.ignore_missing_vars) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/framework/python/ops/variables.py", line 658, in assign_from_checkpoint_fn saver = tf_saver.Saver(var_list, reshape=reshape_variables) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1140, in init self.build() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1172, in build filename=self._filename) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 688, in build restore_sequentially, reshape) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 419, in _AddRestoreOps assign_ops.append(saveable.restore(tensors, shapes)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 155, in restore self.op.get_shape().is_fully_defined()) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/state_ops.py", line 274, in assign validate_shape=validate_shape) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_state_ops.py", line 43, in assign use_locking=use_locking, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2630, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1204, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [16] rhs shape= [84] [[Node: save_1/Assign_4 = Assign[T=DT_FLOAT, _class=["loc:@ssd_300_vgg/block10_box/conv_cls/biases"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/cpu:0"](ssd_300_vgg/block10_box/conv_cls/biases, save_1/RestoreV2_4)]]

Would you please tell me what is going on? Or just tell me why you can make it success. Thank you so much!

HoracceFeng avatar Aug 28 '17 10:08 HoracceFeng

@HoracceFeng I have the same problem..... Did you resolve this problem already?

oowe avatar Sep 05 '17 10:09 oowe

@oowe @HoracceFeng i have the same problem too. have you found a solution?

ghost avatar Nov 19 '17 14:11 ghost

@mosab-r @oowe Hi, the problem had been solved. In this code, the background should also be counted as a class, which means if you have 4 classes in total, you should change the num_classes = 5 but not 4.

HoracceFeng avatar Nov 19 '17 15:11 HoracceFeng

@HoracceFeng @oowe @mosab-r @seasonyang I have the same issue when training only one class. When I set num_classes to 2, the model seems not to be learning. I trained on two of my own datasets and on VOC07, which I selected only to train person.

At first I set lr = 0.001, loss drops rapidly in the first 2k steps then literally stops. So I tried raising lr to 0.01, 0.1, 0.5, 0.9 and 0.99. Then loss starts dropping again. But looking into the histograms, weights tends to 0. mAP stays at 0.

Please let me know if you guys solve the problem.

I am currently training on an adapted version. I rewrote SSD_tensorflow_VOC by @LevinJ, which he adapted from this repo, to modular_SSD_tensorflow. I am currently training it on VOC0712trainval and hope to finetune it for my purpose. The current problem is that it only utilize one GPU for training. I haven't tried to train directly on one class. Please try it out and tell me your results.

yu-jingrui avatar Nov 30 '17 14:11 yu-jingrui

@wangsihfu @balancap I have the same question with you,I am new to detection and have read Yolo and SSD paper.I have understood some part of this code,but to understand all of it is still difficult for me.Now I just want to try to run the code on my own dataset with 2 classes and shape (512,512)

I want to trian the network,how should I do?Which part of the code should I change?And how to make my own tfrecord?

Thanks a lot!

Salight avatar Jun 25 '18 08:06 Salight

Hi @Peilin-Yang , I would like to apply SSD to detect small objects in the images. I am jsut wondering if you have get the code working on detecting small objects? I have tried the method in the following:

https://github.com/balancap/SSD-Tensorflow/issues/222

However, it looks like it doesn't work for me. When I visualize the loss in tensorboard, both "cross_entropy_positive loss" and "localization loss" stay on zero during the training process. Do you have any suggestions? Appreciate it!

Thank you!

lyltencent avatar Nov 21 '18 00:11 lyltencent

attach the loss during my training process: image

lyltencent avatar Nov 21 '18 00:11 lyltencent

@hbdong77 , I don't have the solution. Do you have any ideas? Thanks.

lyltencent avatar Nov 21 '18 06:11 lyltencent

I find that if delete already existing logs , the problem that lhs shape not equal rhs shape will be solved @HoracceFeng

donglin8506 avatar Jan 07 '19 03:01 donglin8506

i have used the ssd mobilenet v1 model for training on new dataset (oranges) . after training results are good for oranges but model detect the apple as orange how to free the other classes? how can i get the accuracy graph and lose graph???? and one thing more, continually i get 2 or 3 lose?

MWaseemMatto avatar Apr 06 '19 21:04 MWaseemMatto

I have similar errors to @chenweiqian and @Peilin-Yang when training on KITTI dataset (importing the KITTI interface written by @balancap from SDC-Vehicle-Detection/datasets/kitti*).

screen shot 2017-04-26 at 11 13 43 am

The mAP I get evaluating using eval_ssd_network.py on KITTI is ~31%.

Low prediction scores and poor localization can be seen in this sample image: screen shot 2017-05-01 at 12 17 57 pm

Can you tell me the version of python and tensorflow? I need a right and proper environment,thanks.

mathuse avatar Jul 22 '19 09:07 mathuse

Hey! Thanks for explanation!

No annotation label used to avoid false positive on the image regions while training. It's clear.

I'm working on training SSD with empty frames, covered with background only. And as far as I understand, it would be sufficient to supply empty lists of labels and bboxes. Which should result in contribution to loss only for negative xentropy part.

Have you tried to training in this setting? What environment is your using?python and tensorflow version

mathuse avatar Jul 22 '19 11:07 mathuse