robot-surgery-segmentation icon indicating copy to clipboard operation
robot-surgery-segmentation copied to clipboard

Program failed to train , I am using one GPU to run the program

Open SMKamrulHasan opened this issue 5 years ago • 7 comments

num train = 0, num_val = 0 Traceback (most recent call last): File "train.py", line 157, in main() File "train.py", line 152, in main num_classes=num_classes File "/content/drive/My Drive/surgery/data/utils.py", line 56, in train model.load_state_dict(state['model']) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 719, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for DataParallel: Missing key(s) in state_dict: "module.encoder.0.weight", "module.encoder.0.bias", ... ...................................................

SMKamrulHasan avatar Oct 25 '18 01:10 SMKamrulHasan

First of all num train = 0, num_val = 0

looks strange. Are you sure that your DataLoader defined in https://github.com/ternaus/robot-surgery-segmentation/blob/master/dataset.py is correct?

ternaus avatar Oct 25 '18 01:10 ternaus

Second model.load_state_dict(state['model']) is trying to load a model which is happening when your folder runs/debug is not empty.

Can you delete it and try again?

ternaus avatar Oct 25 '18 01:10 ternaus

Second model.load_state_dict(state['model']) is trying to load a model which is happening when your folder runs/debug is not empty.

Can you delete it and try again?

Yes, I had deleted the "runs/debug" folder and tried agian. Now it solved the "RuntimeError: Error(s) in loading state_dict for DataParallel" problem but still "num train = 0, num_val = 0"

python prepare_train_val.py python train.py --device-ids 0 --batch-size 16 --fold $3 --workers 12 --lr 0.00001 --n-epochs 20 --type binary --jaccard-weight 1 --model UNet16

Log: num train = 0, num_val = 0 Epoch 1, lr 1e-05: : 0it [00:00, ?it/s] /usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:2957: RuntimeWarning: Mean of empty slice. out=out, **kwargs) /usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) Valid loss: nan, jaccard: nan Epoch 2, lr 1e-05: : 0it [00:00, ?it/s] Valid loss: nan, jaccard: nan Epoch 3, lr 1e-05: : 0it [00:00, ?it/s] Valid loss: nan, jaccard: nan

SMKamrulHasan avatar Oct 25 '18 01:10 SMKamrulHasan

First of all num train = 0, num_val = 0

looks strange. Are you sure that your DataLoader defined in https://github.com/ternaus/robot-surgery-segmentation/blob/master/dataset.py is correct?

And my folder arrangements are: surgery/data/models/ surgery/data/train/instrument_dataset_1 surgery/data/test/instrument_dataset_1 surgery/data/cropped_train/instrument_dataset_1 surgery/data/train.py surgery/data/model.py surgery/data/prepare_data.py surgery/data/prepare_train_val.py surgery/data/dataset.py

SMKamrulHasan avatar Oct 25 '18 02:10 SMKamrulHasan

Can you give me the DATASET from the surgery/data/train/instrument_dataset_1 and surgery/data/test/instrument_dataset_1?

kimdinhthaibk avatar May 27 '19 13:05 kimdinhthaibk

So for anyone encountering this error - check if you changed the problem type: model = get_model(model_path, model_type='UNet11', problem_type='instruments')

zapaishchykova avatar Jul 12 '19 09:07 zapaishchykova

Can you give me the DATASET from the surgery/data/train/instrument_dataset_1 and surgery/data/test/instrument_dataset_1?

https://github.com/ternaus/robot-surgery-segmentation/issues/3#issuecomment-384948063 you might find this link useful.

Di1113 avatar Jul 30 '19 08:07 Di1113