tensorflow-deeplab-v3-plus icon indicating copy to clipboard operation
tensorflow-deeplab-v3-plus copied to clipboard

Training with Different number of classes

Open UR11EC017 opened this issue 6 years ago • 7 comments

First of all, I would like to appreciate your good work!

I am a beginner and currently trying to use the PASCAL VOC trained DeepLab V3+ model that you have provided in the repository to train my own dataset with a different number of classes.

Please guide me through the changes required to make it happen.

UR11EC017 avatar May 19 '18 15:05 UR11EC017

https://stackoverflow.com/questions/47867748/transfer-learning-with-tf-estimator-estimator-framework will be helpful as this implementation uses the TF Estimator API

AshAswin avatar May 23 '18 12:05 AshAswin

Hi @UR11EC017 , thank you for your interest in the repo.

Training with a different number of classes is very straightforward. First, change _NUM_CLASSES given in the codes to the number of classes of your dataset. Then, modify the color map defined here appropriately.

Let me know if you encounter other problem.

rishizek avatar May 29 '18 14:05 rishizek

Dear @rishizek, I have been trying to do the same, that is, use train DeepLab v3+ to train my own dataset with a different number of classes.

First of all, I have created my .record files using create_pascal_tf_record.py. After that, I have changed in the train.py _NUM_CLASSES and _HEIGHT and _WIDTH, to the particular values of my own problem (2 classes and 720x720 images). I also changed the color map. When running train.py using the new computed records, I encountered the following problem. It seems it is happening within a session, but I do not know which part I missed...

File "/home/user/Envs/deeplearning/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [2] rhs shape= [21] [[Node: save/Assign_57 = Assign[T=DT_FLOAT, _class=["loc:@decoder/upsampling_logits/conv_1x1/biases"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoder/upsampling_logits/conv_1x1/biases/Momentum, save/RestoreV2/_1)]] [[Node: save/RestoreV2/_1842 = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="

Is there something else that I should not restore? From my understanding the class information is only specified in the decoding side right? Thanks in advance

esterglez avatar Jul 02 '18 16:07 esterglez

Hi @esterglez , Thank you for your interest in the repo.

I'm not sure what is the exact problem. But it seems that the information for # of classes for PASCAL dataset (=21) is remained somewhere. Either your model architecture has the last layer with # of class 21 but saved checkpoint has 2 or vise versa. And this mismatch occasionally produces the error when loading checkpoint.

You may check if your architecture correctly holds last layer with # of class = 2 using TensorBoard. If your model architecture is correctly hold # of classes = 2, then problem is because you are trying to load checkpoint with # of classes = 21. This sometimes happens when your model_dir is not clean. Namely, you first trained model with PASCAL data, then checkpoint is generated with # of classes = 21, and after that you tried to train model with your dataset (# of classes = 2) and failed to load the checkpoint. You may need to clean model_dir in that case.

I hope this can help you.

rishizek avatar Jul 04 '18 13:07 rishizek

Dear @rishizek ,

"This sometimes happens when your model_dir is not clean. Namely, you first trained model with PASCAL data, then checkpoint is generated with # of classes = 21, and after that you tried to train model with your dataset (# of classes = 2) and failed to load the checkpoint. You may need to clean model_dir in that case."

This was exactly what was happening to me, so thank you very much for your help ;). Now I can continue.

esterglez avatar Jul 04 '18 15:07 esterglez

@esterglez Are you using the pre-trained model? if so you have to stop the last layer to be initialized from the pre-trained model. you have two options for this: define another last layer with the same structure and initialize it manually, or use stop_restore_last_layer.

Sam813 avatar Jul 05 '18 08:07 Sam813

@Sam813 Could you please tell me details about how to change the code to realize it? Thank you very much.

Pandabuaa avatar May 20 '20 03:05 Pandabuaa