tensorflow-deeplab-resnet icon indicating copy to clipboard operation
tensorflow-deeplab-resnet copied to clipboard

Loss not going down

Open vijayg78 opened this issue 7 years ago • 20 comments

Hi, I started a training from scratch with train.py with VOC2012 data set. I downloaded the Augmented GTs and plugged in to the data set. Now the GTs are the augmented GTs and original jpg files from data set. The loss is not going down, it is oscillating. Any clue on how to get it working? Regards, Vijay

vijayg78 avatar Jul 05 '17 11:07 vijayg78

from scratch

Do you mean with randomly initialised model?

DrSleep avatar Jul 06 '17 10:07 DrSleep

i used the deeplab_resnet_init.ckpt and tried to run the train.py file. The loss was oscillating and not coming down at all. I also tried the deeplab_resnet.ckpt same behaviour.

vijayg78 avatar Jul 06 '17 12:07 vijayg78

I used the JPEGimages from VOCdevkit and GTs were pointed to Augmented images i downloaded from this github. Thats correct right?

vijayg78 avatar Jul 06 '17 14:07 vijayg78

Same problem for a model which doesn't use the deeplab_resnet.ckpt file to init

akshittyagi avatar Jul 06 '17 17:07 akshittyagi

what are the images in your tensorboard after few iterations?

DrSleep avatar Jul 07 '17 07:07 DrSleep

i have same problem ,i use my own datatset(3 classes ) to train.Loss value was oscillating and not coming down at all. LOSS 1.2~1.3

Hjy20255 avatar Jul 10 '17 14:07 Hjy20255

@DrSleep there are no images being produced in tensorboard

akshittyagi avatar Jul 10 '17 15:07 akshittyagi

2all: the hyperparameters (learning rate, batch size, momentum, etc.) have been chosen on Pascal VOC (for the procedure behind these choices, please refer to the original paper). It is not the case that the same hyperparameters would be suitable for other datasets, thus it is your task finding an appropriate set of hyperparameters for your dataset.

This repository is a replication of an academic paper. Anything else besides that is a bonus (like an ability to train on your own datasets).

DrSleep avatar Jul 12 '17 07:07 DrSleep

Okay. But the model is also not working for VOC dataset when not using the pretrained .ckpt file

akshittyagi avatar Jul 12 '17 07:07 akshittyagi

It works (proof, proof) on VOC with either pre-trained or not pre-trained files. Make sure that your setup is correct.

DrSleep avatar Jul 12 '17 07:07 DrSleep

I also meet this problem, I use VOC2012, and pretrained model..

wangruixing avatar Jul 18 '17 09:07 wangruixing

same here

dongzhuoyao avatar Aug 13 '17 07:08 dongzhuoyao

In my case I used my own data set to do training. At first I took train.py then the loss went down very very very slowly (from 10 to 8 for 60000 steps), then I took another script train_msc.py and the loss began to go down very quickly , and I found that the second one did training better than the first since the loss was much smaller (about 3 instead of 8 in my case).

chenyuZha avatar Sep 22 '17 09:09 chenyuZha

May I know the final loss for after running train.py for 20K iterations with deeplab_resnet_init.ckpt as a start? I used PASCAL dataset and the final loss was about 1.3. It would be better if you could provide the graph of your training curve?

zhengyang-wang avatar Oct 02 '17 18:10 zhengyang-wang

Same here. With the default configuration and PascalVOC the loss oscillates between 1.2-1.3. Could someone plot the training curve or tell which are the loss values after 20K iterations for example? Thanks!

ChuanWang90 avatar Nov 18 '17 04:11 ChuanWang90

Have someone show the loss after 20K? It is about 1.18 in my PC. Or who knows the reason?

FeiWard avatar Dec 27 '17 09:12 FeiWard

my loss is always about 1.3 and the result predicted the images is black,nothing result.I use default hyperparameters and voc2012 dataset with deeplab_resnet.ckpt as a start.why doesn't work?

EternityZY avatar Apr 29 '18 05:04 EternityZY

my loss is always about 1.3 and the result predicted the images is black,nothing result.I use default hyperparameters and voc2012 dataset with deeplab_resnet.ckpt as a start.why doesn't work?

Hi were you able to solve the issue.

PallawiSinghal avatar Dec 10 '19 18:12 PallawiSinghal

Hi, My loss does not change. It has become stagnant. I have tried everything mentioned related to deeplabv3+ on every blog. I am training to detect roads. My images are of 2000x2000. My training data has 45k images. I have created my image in the form of PASCAL VOC. I have three kinds of pixels. background = [0,0,0] Void class = [255,255,255] road = [1,1,1] so the number of classes = 3 I am using PASCAL VOC pre trained weights.

changes in train_util.py are : 1. ignore_weight = 0 label0_weight =10 label1_weight = 15 not_ignore_mask = tf.to_float(tf.equal(scaled_labels, 1)) * label0_weight

  • tf.to_float(tf.equal(scaled_labels, 2)) * label1_weight
  • tf.to_float(tf.equal(scaled_labels, ignore_label)) * ignore_weight

Variables that will not be restored.

exclude_list = ['global_step','logits'] if not initialize_last_layer: exclude_list.extend(last_layers)

my train.py

nohup python deeplab/train.py
--logtostderr
--training_number_of_steps=65000
--train_split="train"
--model_variant="xception_65"
--atrous_rates=6
--atrous_rates=12
--atrous_rates=18
--output_stride=16
--decoder_output_stride=4
--train_batch_size=2
--initialize_last_layer=False
--last_layers_contain_logits_only=True
--dataset="pascal_voc_seg"
--tf_initial_checkpoint="/data/old_model/models/research/deeplabv3_pascal_trainval/model.ckpt"
--train_logdir="/data/old_model/models/research/deeplab/mycheckpoints"
--dataset_dir="/data/models/research/deeplab/datasets/tfrecord" > my_output.log &

Please help 👍 INFO:tensorflow:global step 700: loss = 0.1759 (0.449 sec/step) INFO:tensorflow:global step 710: loss = 0.1695 (0.655 sec/step) INFO:tensorflow:global step 720: loss = 0.1742 (0.689 sec/step) INFO:tensorflow:global step 730: loss = 0.1710 (0.505 sec/step) INFO:tensorflow:global step 740: loss = 0.1708 (0.868 sec/step) INFO:tensorflow:global step 750: loss = 0.1683 (0.632 sec/step) INFO:tensorflow:global step 760: loss = 0.1692 (0.442 sec/step) INFO:tensorflow:global step 770: loss = 0.1693 (0.597 sec/step) INFO:tensorflow:global step 780: loss = 0.1665 (0.441 sec/step) INFO:tensorflow:global step 790: loss = 0.1680 (0.548 sec/step) INFO:tensorflow:global step 800: loss = 0.1708 (0.372 sec/step) INFO:tensorflow:global step 810: loss = 0.1674 (0.327 sec/step) INFO:tensorflow:global step 820: loss = 0.1666 (0.951 sec/step) INFO:tensorflow:global step 830: loss = 0.1651 (0.557 sec/step) INFO:tensorflow:global step 840: loss = 0.1663 (0.506 sec/step) INFO:tensorflow:global step 850: loss = 0.1646 (0.446 sec/step) INFO:tensorflow:global step 860: loss = 0.1666 (0.424 sec/step) INFO:tensorflow:global step 870: loss = 0.1654 (0.520 sec/step) INFO:tensorflow:global step 880: loss = 0.1662 (0.675 sec/step) INFO:tensorflow:global step 890: loss = 0.1673 (0.325 sec/step) INFO:tensorflow:global step 900: loss = 0.1633 (0.548 sec/step) INFO:tensorflow:global step 910: loss = 0.1659 (0.374 sec/step) INFO:tensorflow:global step 920: loss = 0.1639 (0.663 sec/step) INFO:tensorflow:global step 930: loss = 0.1658 (0.442 sec/step) INFO:tensorflow:global step 940: loss = 0.1654 (0.568 sec/step)

PallawiSinghal avatar Dec 11 '19 12:12 PallawiSinghal

@PallawiSinghal Did u find a solution to your problem?

subbulakshmisubha avatar May 21 '20 14:05 subbulakshmisubha