PSPNet-tensorflow Prediction always zero

I am training the model with 480x360 images and 3 classes ( gray-scale labels with values 0,1,2). During the training the loss decreases and everything seems to work but during inference the model always predicts 0. Does anybody have the same problem?

Nov 22 '17 11:11 psuff

I do only modify the input_size to 512*512,because my gpu just has 6GB memory,and my train result is very bad too,nothing can be predictd right.I don't know where is wrong,can you give me a e-mail,maybe we can talk about it.

Nov 24 '17 09:11 smmdream

Same as you guys. I reduced classes to 2, and make predication visualized in epoch, but no improvement happened.

Nov 24 '17 11:11 anqingjianke

Hey guys, maybe the problem is you need to update the moving mean/variance first. Then use the new moving mean/variance to train beta & gamma variable. I have successfully trained on my own dataset in the ICNet model. Do you train with the flag --update-mean-var?

Nov 24 '17 11:11 hellochick

@hellochick, Thanks. Maybe I forgot adding this parameter.

Nov 29 '17 01:11 anqingjianke

I trained model by Cityscapes dataset. During training, I reduced lr from 1e-3 to 1e-5, but loss value was always high(2.5, loss of two classification data can reduce to 2.1) and can not reduce after more than 20 epoches. Like this. Therefore, the test result was not accurate. Do I need to train more epoches or adjust other parameters？ Thank you very much.

Nov 29 '17 01:11 anqingjianke

Hey @anqingjianke , Could you tell me how you process data into only 2 classes? Maybe the problems occurs here. Btw, after you update mean/var for a period of time, you can try to fixed them and then keep training on the gamma & beta variable, like this.

python train.py --update-mean-var --train-beta-gamma

Then do

python train.py --train-beta-gamma

Nov 29 '17 07:11 hellochick

I classified car, bus and truck to one class and regard others as second class. By training on the gamma & beta variable after some epoches, the loss can reduce. I think I need to do research on the reason. Anyhow, it can works. That is great. Thank you for sharing your code.

Nov 30 '17 05:11 anqingjianke

@anqingjianke do you get any good results with this procedure? For how many steps do you need to update mean and variance before training beta and gamma?

Nov 30 '17 18:11 psuff

@psuff I get much better result than before. The train error becomes lower, but eva error is still not very low. Anyhow I can accept it. Last time, I used cityscapes dataset and trained 20000 steps, maybe 4 epochs? Than trained beta and gamma. I am not sure whether do as what I have done can get best result.

Dec 01 '17 08:12 anqingjianke

@anqingjianke , here is another advice, I think after you train for beta & gamma, you can fixed all of them and fine-tune the model, just do train.py. Maybe it can be much better.

Dec 01 '17 09:12 hellochick

@anqingjianke when I use MomentumOptimizer I also meet the loss can only reduce to 1.6 for long long time and the test accuracy is low. when I use AdamOptimizer I can reduce the loss to 0.5 but the test accuracy is about 50 mIOU. Do you finally retrain successfully from scartch?

Jan 10 '18 03:01 manutdzou

@hellochick Hi I follow your advice to python train.py --update-mean-var --train-beta-gamma and python train.py --train-beta-gamma. Finally the model failed. I am confused in tensorflow when train the BN layer it must with --update-mean-var like use
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) with tf.control_dependencies(update_ops): without --update-mean-var it is wrong. How can you successfully train the model? In order to prove it I use the released model it works well in inference, and training loss is 0.5. while I finetune the model without --update-mean-var the loss immediately become 2.5. Is there any wrong? And my gpu is 24G and can train with 4 images a batch

Jan 10 '18 03:01 manutdzou

Hey @manutdzou, I am not sure this code can re-produce the performance of original work or not. Because the author haven't provide the code of training phaze, I can only do my best to imagine what he has done. I have tried what you said, I using the train released model without --update-mean-var, the loss still around 0.2. As you can see in the following image: Is there misunderstanding between me and you ? If you find anything wrong with my training code, just tell me, thank you !

Jan 10 '18 07:01 hellochick

@hellochick Can you use the saved checkpoint model without --update-mean-var to evaluate the data. Is the result still right?

This is my result with update mean and var train beat and garm. The training loss is below step 33100 loss = 0.230, (2.569 sec/step) step 33200 loss = 0.362, (2.572 sec/step) step 33300 loss = 0.214, (2.551 sec/step) step 33400 loss = 0.266, (2.541 sec/step) step 33500 loss = 0.207, (2.561 sec/step) step 33600 loss = 0.250, (2.537 sec/step) step 33700 loss = 0.267, (2.548 sec/step) step 33800 loss = 0.205, (2.570 sec/step) step 33900 loss = 0.212, (2.560 sec/step) step 34000 loss = 0.202, (2.603 sec/step) step 34100 loss = 0.213, (2.554 sec/step) step 34200 loss = 0.263, (2.570 sec/step) step 34300 loss = 0.252, (2.554 sec/step) step 34400 loss = 0.201, (2.575 sec/step) step 34500 loss = 0.226, (2.566 sec/step) step 34600 loss = 0.314, (2.561 sec/step) step 34700 loss = 0.167, (2.541 sec/step) step 34800 loss = 0.194, (2.576 sec/step) step 34900 loss = 0.203, (2.633 sec/step) step 35000 loss = 0.231, (2.556 sec/step) step 35100 loss = 0.265, (2.732 sec/step) step 35200 loss = 0.215, (2.555 sec/step) step 35300 loss = 0.214, (2.569 sec/step) step 35400 loss = 0.290, (2.565 sec/step) step 35500 loss = 0.188, (3.461 sec/step) step 35600 loss = 0.269, (2.582 sec/step) step 35700 loss = 0.230, (2.554 sec/step) step 35800 loss = 0.247, (2.562 sec/step) step 35900 loss = 0.240, (2.558 sec/step) step 36000 loss = 0.249, (2.562 sec/step) step 36100 loss = 0.211, (2.636 sec/step) step 36200 loss = 0.213, (2.554 sec/step) step 36300 loss = 0.191, (2.554 sec/step) step 36400 loss = 0.211, (7.054 sec/step) step 36500 loss = 0.248, (2.561 sec/step) The evaluate result is step 330 mIoU: 0.588765621185 Finish 340/500 step 340 mIoU: 0.589641332626 Finish 350/500 step 350 mIoU: 0.589398264885 Finish 360/500 step 360 mIoU: 0.592406332493 Finish 370/500 step 370 mIoU: 0.591849982738 Finish 380/500 step 380 mIoU: 0.598852872849 Finish 390/500 step 390 mIoU: 0.598673403263 Finish 400/500 step 400 mIoU: 0.599280178547 Finish 410/500 step 410 mIoU: 0.598650693893 Finish 420/500 step 420 mIoU: 0.598635554314 Finish 430/500 step 430 mIoU: 0.597562968731 Finish 440/500 step 440 mIoU: 0.596583664417 Finish 450/500 step 450 mIoU: 0.595161557198 Finish 460/500 step 460 mIoU: 0.594586551189 Finish 470/500 step 470 mIoU: 0.59452599287 Finish 480/500 step 480 mIoU: 0.595708370209 Finish 490/500 step 490 mIoU: 0.594085931778 step 499 mIoU: 0.59619218111

I want to know why the result is not well and reproduce the training process. Thank you。 In addition I found your training speed is faster than mine. What is your gpu card? Mine is P40

Jan 11 '18 06:01 manutdzou

I'm not sure, if PSPnet can be trained from scratch on a different dataset (with different number of labels). The batch size can only set to 1 or 2, and the training loss do not decreases after a few steps. I guess the original authors of PSPnet used massive parallel processing to train on the datasets.

Mar 14 '18 14:03 Soumyabrata

Had the same problem. My groundtruth labels had colors per class. Changed this to Grayscale: (0, 0, 0) for background and (1, 1, 1) for class 1 etc. I also use moving average and update beta gamma

Jul 23 '18 14:07 dketterer

BATCH_SIZE = 16 LEARNING_RATE = 1e-3 MOMENTUM = 0.9 NUM_CLASSES = 19 NUM_STEPS = 60001 POWER = 0.9 RANDOM_SEED = 1234 WEIGHT_DECAY = 0.0001 PRETRAINED_MODEL = SNAPSHOT_DIR = './snapshots/' SAVE_NUM_IMAGES = 4 SAVE_PRED_EVERY = 50

this code in train.py but I can not to train,how to load PRETRAINED_MODEL =? the model files have foures ckpt files how to load

Aug 06 '18 02:08 wangchuanya

@manutdzou Hello, I'm wondering how to change MomentumOptimizer to AdamOptimizer, could you please give me a hand?

Aug 13 '18 06:08 lizleo

@hellochick Hello, I try to fine-tune the ade20k model on ScanNet dadaset. As you said, firstly I use "train ----update-mean-var" to train after 10 epochs, and the mIoU is about 0.3. And then I use "train ----update-mean-var --train-beta-gamma" . I find that the loss is not really lower than before, and the mIoU is lower than 0.3. I really don't know why. Could you please give me a hand? And I'm really looking forward to your reply. Thanks~

Aug 29 '18 11:08 lizleo

@hellochick why did you make base learning rate 1 e- 3 not 1 e- 2 as mentioned in pspnet paper?

Nov 27 '18 18:11 AyaMohamedS

I trained model by Cityscapes dataset. During training, I reduced lr from 1e-3 to 1e-5, but loss value was always high(2.5, loss of two classification data can reduce to 2.1) and can not reduce after more than 20 epoches. Like this. Therefore, the test result was not accurate. Do I need to train more epoches or adjust other parameters？ Thank you very much. @anqingjianke hello if your PSPNET can train to classify tow classes? I train PSPNET use my dataset but the result always is zero ,my dataset can be classify use segnet.So I want to know pspnet is suitable for two classes classification? Please reply thank you .

May 14 '19 06:05 anqin5211314

PSPNet-tensorflow PSPNet-tensorflow copied to clipboard

Prediction always zero

PSPNet-tensorflow
PSPNet-tensorflow copied to clipboard