ICNet-tensorflow icon indicating copy to clipboard operation
ICNet-tensorflow copied to clipboard

README instructions not working for training on my own dataset

Open ogail opened this issue 6 years ago • 46 comments

Hi, I tried to follow README instructions for training on my own dataset but it didn't work. Here is what I did:

  • Update DATA_DIR to point to dataset dir
  • Update DATA_LIST_PATH to point to train dataset list file.
  • Update INPUT_SIZE to '1280, 720'
  • Update NUM_CLASSES to 1
  • Update LAMBDA1 and LAMBDA2 to 0.4 and 0.6 respectively.

Then ran cmd

python train.py --update-mean-var --train-beta-gamma

Then got this error (shortened)

ValueError: Dimension 3 in both shapes must be equal, but are 1 and 19. Shapes are [1,1,128,1] and [1,1,128,19]. for 'conv6_cls_1/Assign' (op: 'Assign') with input shapes: [1,1,128,1], [1,1,128,19].

Troubleshooting (none of that worked):

  • Tried to follow advise from https://github.com/hellochick/ICNet-tensorflow/issues/20 by doing the following
  • Updating icnet_cityscapes_bnnomerge.prototxt by changing conv6_cls num_output from 19 to 1
  • Then replaced this line in train.py
 restore_var = tf.global_variables()

with

restore_var = [v for v in tf.global_variables() if 'conv6_cls' not in v.name]

Then I got same exact error mentioned above.

If anyone was able to train on their own dataset (either using pretrained model or from scratch) please provide steps of changes you did.

Thanks

ogail avatar Mar 07 '18 00:03 ogail

Hey @ogail, Since by default is to load pre-trained model and keep finetuning on it. However, the pre-trained cityscapes has 19 classes, while your dataset has only 1. You can comment line 191 to solve the problem, training from scratch.

hellochick avatar Mar 07 '18 01:03 hellochick

  • I think you mean line 189 which is: net.load(args.restore_from, sess)

I tried and it results in loss being ‘nan’

  • i also tried to load from a saved checkpoint (instead of numpy) however issue was that loss is fixed at 0.511 and these sub4 =0.000 sub24 =0.000 sub124 =0.000 do not change at all.

Any ideas?

ogail avatar Mar 07 '18 01:03 ogail

Before that, I want to know what your dataset look like, can you show some examples? If there is only one class, it doesn't need to train anymore, am I right?

hellochick avatar Mar 07 '18 01:03 hellochick

The dataset has 2 class obstacles (0) and non-obstacles (255) in a binary format. Here is an example of raw image img_00002 This is an example of label image (similar to * labelTrainIds * images in cityscapes) img_00002

Think of this as semantic segmentation with two labels (background and foreground). Hope it makes sense. FYI i set the IGNORE_LABEL to 0

ogail avatar Mar 07 '18 01:03 ogail

It make sense to me. For this case, I think it's difficult to learn to detect obstacles, since the obstacles contain several different objects. Hence, I think you can restore a pre-trained ImageNet, or ADE20k segmentation, and set the learning rate much lower to try on this task.

Btw, I have tried to do the obstacle detection before, and you can refer to Indoor Segmentation. In this project, I detect obstacles by training on ADE20k, and I compressed num_classes from 150 to 27, just for your reference.

hellochick avatar Mar 07 '18 02:03 hellochick

Im trying to do something similar with LFW data set http://vis-www.cs.umass.edu/lfw/part_labels/ i've set num_classes to 3 and rearranged masks so that mask is a gray scale image where 0 is hair, 1 is face and 2 is background. I also removed the net.load line on the code the error im getting is when the line loss = tf.nn.sparse_softmax_cross_entropy_with_logits is being called. ValueError: Rank mismatch: Rank of labels (received 1) should equal rank of logits minus 1 (received 1).

Can u please explain what the function create_loss expects as input? what is the shape of output, label? when i try it i get label of shape (16,250,250,1) and output of shape (16,15,15,3) after reshaping raw_pred is of shape=(10800,) but label is of shape=(3600,) there is a mismatch here and i suspect its why the function fails, but I cant seem to understand what to do.

Danzip avatar Mar 07 '18 11:03 Danzip

@hellochick I finally got it working, here are steps I did:

  • commenting net.load line
  • Setting number of classes to 2
  • Setting IGNORE_LABEL to arbitrary number not 0 or 255 (i set it to 100) Then trained network and got good prediction results (I had to update inference.py and tools.py to get this working): img_00197 Here is original image img_00197

What I did for training is following:

  • Run python train.py for 8 hrs until loss reached 0.281, then stopped.
  • Run xpython train.py --update-mean-var --train-beta-gamma (still running) and loss is dropping to 0.27 and continuing.

when you trained on other datasets, how (meaning how long and what's purpose) do you use train.py and train.py --update-mean-var --train-beta-gamma

ogail avatar Mar 07 '18 14:03 ogail

@ogail Thank you for the information you provided. Any change, you could make your script public? It would help us a lot. Thank you in advance

bhadresh74 avatar Apr 09 '18 16:04 bhadresh74

@bhadresh74 is there a specific question you have?

ogail avatar Apr 09 '18 16:04 ogail

@ogail Yes. Couple of them actually.

  1. I trained on two classes but my loss seems to be stuck at 0.6 and not going down. Here are my HP: batch size: 64 Steps: 60000 Others are as given in the repo.

  2. While inference, how can I extract probability for each class. The given code returns 0 probability for each pixel for some reason. I would like to know how did you extract the softmax logits?

Thank you

bhadresh74 avatar Apr 09 '18 16:04 bhadresh74

@bhadresh74 Here are some suggestions: 1- getting loss to 0.6 is good indication, pushing it more will require some tinkering like:

  • increasing number of training steps
  • increasing batch size
  • checking to see if ground truth labels has some errors that are consistently failing. 2- I have not tried to extract the probability before.

ogail avatar Apr 09 '18 22:04 ogail

Hi, I would like to make a question for you, @ogail , since I had the same problems: I see that you have done the following:

  • commenting net.load line
  • Setting number of classes to 2
  • Setting IGNORE_LABEL to arbitrary number not 0 or 255 (i set it to 100)

But have you also made the changes that you stated at the beginning? Manly:

  • Updating icnet_cityscapes_bnnomerge.prototxt by changing conv6_cls num_output from 19 to 1
  • Then replaced this line in train.py
    

restore_var = tf.global_variables()

with

restore_var = [v for v in tf.global_variables() if 'conv6_cls' not in _v.name]

BCJuan avatar Apr 16 '18 09:04 BCJuan

@BCJuan excuse me for late reply. Yes I did both changes as well.

ogail avatar Apr 18 '18 14:04 ogail

@hellochick @ogail hello, my question is that, if my datasets is 2 class, I can only use this network by training from scratch? Can't use the previous layers of pre-trained mode or just train the last cls layer? Because in my experiment with caffe, I can train based on pre-trained models. I am not familiar with tf, but the deeplab_v3+_ tensorflow can also support only training the last layer.

qmy612 avatar Apr 28 '18 09:04 qmy612

Yes, you will have to train from scratch

ogail avatar Apr 30 '18 13:04 ogail

Hi, In response to @qmy612 (also @ogail ): you can indeed use the pretrained model.

I achieved it yesterday doing the following:

  • As in #20 do: Updating icnet_cityscapes_bnnomerge.prototxt by changing conv6_cls num_output from 19 to your number of classes. (this is from @ogail initial question)
  • Then go to network.py, to the load function of class Network and add the following line:if 'conv6_cls' not in var.name: before the line session.run(var.assign(data)). Also change ignore_missing to True

The function should look something like:

def load(self, data_path, session, ignore_missing=True):
        data_dict = np.load(data_path, encoding='latin1').item()
        for op_name in data_dict:
            with tf.variable_scope(op_name, reuse=True):
                for param_name, data in data_dict[op_name].items():
                    try:
                        if 'bn' in op_name:
                            param_name = BN_param_map[param_name]

                        var = tf.get_variable(param_name)
                        if 'conv6_cls' not in var.name:
                            session.run(var.assign(data))
                    except ValueError:
                        if not ignore_missing:
                            raise

Then, you can make the change stated in #20, I mean changing:

restore_var = tf.global_variables()

by

restore_var = [v for v in tf.global_variables() if 'conv6_cls' not in v.name]

or not.

Indeed it would have the same effect since you have not loaded the conv6_cls from the pretrained model, which is the last layer (classification) of the net.

Hope this helps.

BCJuan avatar Apr 30 '18 17:04 BCJuan

@BCJuan did fine-tuning from pretrained model boosted on your custom task? Have u tried to compare that vs training from scratch?

ogail avatar Apr 30 '18 20:04 ogail

Yes, it boosted the results. Indeed I was not obtaining any good results without the pretrained model.

I used the icnet_cityscapes_trainval_bnomerge_90k, but I think that any other model can be used.

BCJuan avatar Apr 30 '18 20:04 BCJuan

@BCJuan what's mIoU before and after using cityecapes pretrained model?

ogail avatar Apr 30 '18 22:04 ogail

@BCJuan I did load the pretrained model however didn't see much diff between fine-tuning vs training from scratch.

ogail avatar May 01 '18 03:05 ogail

@ogail I do not know since I am just finetunning. But with a one hour run, using the pretrained model I achieve like 20% mIoU while without it 6%. Maybe I am doing something wrong.

BCJuan avatar May 01 '18 10:05 BCJuan

@BCJuan Thank you very much, I will try tomorrow.

qmy612 avatar May 01 '18 13:05 qmy612

I'll try finetuning too and will report the results. But from my experience finetuning always gives the boost in tendency to generalization of the model so it's nice to try

seovchinnikov avatar May 23 '18 18:05 seovchinnikov

Hi @ogail, Thank you very much for sharing your training steps for us. Recently, I need to solve the same problem like you, I set my network parameter to the same like yours and the loss can become to 0.17 and keep going down. However, when I inference my net, the result shows all the image came to 0 or 1, it seems not quite right. Did you have this problem? Thank you!

VincentGu11 avatar Jun 01 '18 06:06 VincentGu11

@ogail Have you tried to train it for multiple classes? What annotating tool I can use for training it on multiple classes? Also what is the accuracy and fps you are getting on evaluation?

PratibhaT avatar Jun 18 '18 22:06 PratibhaT

@VincentGu11 is the 0 and/or 1 are how the final rendered image looks like? There's function decode_label that converts training index to RGB color

@PratibhaT yes I tried. you could use labelme tool. The accuracy and fps depends on the data and the problem so my numbers wont be relevant in general sense.

ogail avatar Jun 18 '18 22:06 ogail

@ogail I used VIA annotation tool, which gives .json file. But in this code the list.txt refers to .png image for label. Is there a way to convert a .json annotation files to .png to be used as label. What is the output of labelme tool?

PratibhaT avatar Jun 19 '18 03:06 PratibhaT

@ogail I am training it for my own dataset consisting of 8 classes. I did all the required changes mentioned above but i am still getting the following error :-

Assign requires shapes of both tensors to match. lhs shape= [8] rhs shape= [19]

Is there some particular change that i missed out?

adisrivasa avatar Jun 22 '18 07:06 adisrivasa

@qmy612 ,Can you share some details about your training with caffe framework? I have the problem to train with matcaffe downloaded .

Soulempty avatar Jun 25 '18 01:06 Soulempty

@ogail Thank you for the information you provided. Can I ask you two questions? 1.Did you use the ADE20k or any other pre-training model to fine-tune when training your own datasets?

2.What is the basis for setting the IGNORE_LABEL value? Looking forward to your answer.

yeyuanzheng177 avatar Jun 27 '18 13:06 yeyuanzheng177