object_detection_kitti icon indicating copy to clipboard operation
object_detection_kitti copied to clipboard

The num_of_steps setting for Inception_v2

Open wesley-stone opened this issue 6 years ago • 3 comments

First of all, thank you very much. I noticed that 'num_steps' in 'faster_rcnn_inception_resnet_v2_atrous_kitti.config' file is not specified. Is this mean it would train infinitely? If so, could you share your experience on how many steps would be enough to have a stable loss?

wesley-stone avatar May 30 '18 07:05 wesley-stone

yes, I think my loss got stable after roughly 12h training on 1 GPU.

On Wed, May 30, 2018 at 3:53 AM ShiAGou [email protected] wrote:

First of all, thank you very much. I noticed that 'num_steps' in 'faster_rcnn_inception_resnet_v2_atrous_kitti.config' file is not specified. Is this mean it would train infinitely? If so, could you share your experience on how many steps would be enough to have a stable loss?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sshleifer/object_detection_kitti/issues/5, or mute the thread https://github.com/notifications/unsubscribe-auth/AFw9YUh7Ux-LIY9FHWyabYm_shaZ3fboks5t3k_xgaJpZM4USxY8 .

sshleifer avatar May 30 '18 15:05 sshleifer

I have trained it for about 21 hours on one TITAN X GPU with 1.2 steps/second. But my loss still fluctuate between 0 to 1. Did you change any parameters in 'faster_rcnn_inception_resnet_v2_atrous_kitti.config' such as learning rate? It seems from 0 to 900k steps, the learning rate is a constant .0003.

I found the training procedure could be significantly slowed down when running eval.sh at the same time. So I did not run eval currently. Will this affect the result?

thanks

this is my current training loss state:

INFO:tensorflow:global step 95931: loss = 0.4842 (0.827 sec/step)
INFO:tensorflow:global step 95932: loss = 0.2304 (0.831 sec/step)
INFO:tensorflow:global step 95933: loss = 0.6756 (0.824 sec/step)
INFO:tensorflow:global step 95934: loss = 0.5103 (0.829 sec/step)
INFO:tensorflow:global step 95935: loss = 0.3497 (0.820 sec/step)
INFO:tensorflow:global step 95936: loss = 0.3261 (0.829 sec/step)
INFO:tensorflow:global step 95937: loss = 0.3748 (0.823 sec/step)
INFO:tensorflow:global step 95938: loss = 0.1620 (0.826 sec/step)
INFO:tensorflow:global step 95939: loss = 0.3487 (0.828 sec/step)
INFO:tensorflow:global step 95940: loss = 0.3864 (0.823 sec/step)
INFO:tensorflow:global step 95941: loss = 0.1237 (0.827 sec/step)
INFO:tensorflow:global step 95942: loss = 0.4237 (0.827 sec/step)
INFO:tensorflow:global step 95943: loss = 0.2671 (0.841 sec/step)
INFO:tensorflow:global step 95944: loss = 0.5672 (0.873 sec/step)
INFO:tensorflow:global step 95945: loss = 0.2411 (0.889 sec/step)
INFO:tensorflow:global step 95946: loss = 0.3034 (0.876 sec/step)
INFO:tensorflow:global step 95947: loss = 0.0378 (0.883 sec/step)
INFO:tensorflow:global step 95948: loss = 0.2312 (0.876 sec/step)
INFO:tensorflow:global step 95949: loss = 0.1306 (0.855 sec/step)
INFO:tensorflow:global step 95950: loss = 0.3180 (0.818 sec/step)

default config in 'faster_rcnn_inception_resnet_v2_atrous_kitti.config' is

train_config: {
  batch_size: 1
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0003
          schedule {
            step: 0
            learning_rate: .0003
          }
          schedule {
            step: 900000
            learning_rate: .00003
          }
          schedule {
            step: 1200000
            learning_rate: .000003
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: "faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017/model.ckpt"
  from_detection_checkpoint: true
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
}

wesley-stone avatar May 31 '18 05:05 wesley-stone

From here, looks to me like you are evaluating loss on a per image basis, which is not a very good accurate proxy for your train loss over the whole dataset or your validation loss. I'd recommend looking at some validation metrics on tensorboard to figure out when to stop.

On Thu, May 31, 2018 at 1:21 AM ShiAGou [email protected] wrote:

I have trained it for about 21 hours on one TITAN X GPU with 1.2 steps/second. But my loss still fluctuate between 0 to 1. Did you change any parameters in 'faster_rcnn_inception_resnet_v2_atrous_kitti.config' such as learning rate? thanks

this is my current training loss state:

INFO:tensorflow:global step 95931: loss = 0.4842 (0.827 sec/step) INFO:tensorflow:global step 95932: loss = 0.2304 (0.831 sec/step) INFO:tensorflow:global step 95933: loss = 0.6756 (0.824 sec/step) INFO:tensorflow:global step 95934: loss = 0.5103 (0.829 sec/step) INFO:tensorflow:global step 95935: loss = 0.3497 (0.820 sec/step) INFO:tensorflow:global step 95936: loss = 0.3261 (0.829 sec/step) INFO:tensorflow:global step 95937: loss = 0.3748 (0.823 sec/step) INFO:tensorflow:global step 95938: loss = 0.1620 (0.826 sec/step) INFO:tensorflow:global step 95939: loss = 0.3487 (0.828 sec/step) INFO:tensorflow:global step 95940: loss = 0.3864 (0.823 sec/step) INFO:tensorflow:global step 95941: loss = 0.1237 (0.827 sec/step) INFO:tensorflow:global step 95942: loss = 0.4237 (0.827 sec/step) INFO:tensorflow:global step 95943: loss = 0.2671 (0.841 sec/step) INFO:tensorflow:global step 95944: loss = 0.5672 (0.873 sec/step) INFO:tensorflow:global step 95945: loss = 0.2411 (0.889 sec/step) INFO:tensorflow:global step 95946: loss = 0.3034 (0.876 sec/step) INFO:tensorflow:global step 95947: loss = 0.0378 (0.883 sec/step) INFO:tensorflow:global step 95948: loss = 0.2312 (0.876 sec/step) INFO:tensorflow:global step 95949: loss = 0.1306 (0.855 sec/step) INFO:tensorflow:global step 95950: loss = 0.3180 (0.818 sec/step)

default config in 'faster_rcnn_inception_resnet_v2_atrous_kitti.config' is

train_config: { batch_size: 1 optimizer { momentum_optimizer: { learning_rate: { manual_step_learning_rate { initial_learning_rate: 0.0003 schedule { step: 0 learning_rate: .0003 } schedule { step: 900000 learning_rate: .00003 } schedule { step: 1200000 learning_rate: .000003 } } } momentum_optimizer_value: 0.9 } use_moving_average: false } gradient_clipping_by_norm: 10.0 fine_tune_checkpoint: "faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017/model.ckpt" from_detection_checkpoint: true data_augmentation_options { random_horizontal_flip { } } }

It seems from 0 to 900k steps, the learning rate is a constant .0003?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/sshleifer/object_detection_kitti/issues/5#issuecomment-393407328, or mute the thread https://github.com/notifications/unsubscribe-auth/AFw9YXduFeHzmhCdyyi0wKsXRmkI1m7fks5t3329gaJpZM4USxY8 .

sshleifer avatar May 31 '18 17:05 sshleifer