pytorch-semantic-segmentation icon indicating copy to clipboard operation
pytorch-semantic-segmentation copied to clipboard

Crashing intermittently when training psp_net with VOC dataset.

Open andreasrobinson opened this issue 7 years ago • 10 comments

I have been getting this error consistently before it manages complete a single epoch:

[epoch 1], [iter 630 / 8498], [train main loss 1.32866], [train aux loss 1.31173]. [lr 0.0049055503] Traceback (most recent call last): File "train.py", line 252, in main() File "train.py", line 105, in main train(train_loader, net, criterion, optimizer, curr_epoch, args, val_loader, visualize) File "train.py", line 113, in train for i, data in enumerate(train_loader): File "/home/andreas/anaconda2/envs/env/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 201, in next return self._process_next_batch(batch) File "/home/andreas/anaconda2/envs/env/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 221, in _process_next_batch raise batch.exc_type(batch.exc_msg) ValueError: Traceback (most recent call last): File "/home/andreas/anaconda2/envs/env/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 40, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/andreas/Dropbox/src/pytorch-semantic-segmentation/datasets/voc.py", line 99, in getite m img, mask = torch.stack(img_slices, 0), torch.stack(mask_slices, 0) File "/home/andreas/anaconda2/envs/env/lib/python2.7/site-packages/torch/functional.py", line 59, in stack raise ValueError("stack expects a non-empty sequence of tensors") ValueError: stack expects a non-empty sequence of tensors

andreasrobinson avatar Jan 08 '18 03:01 andreasrobinson

I met the same problem. It seems that the data augmentation codes don't work well. Can you check it? @ZijunDeng

lzj322 avatar Jan 09 '18 09:01 lzj322

@andreasrobinson @lzj322 Have you fix this problems? I also met the same problem

zhijiew avatar Jan 11 '18 03:01 zhijiew

I find it!

This error is caused by some training data, maybe there are some error when preprocess data, so I just delete them and my training code can run.

just delete these lines in train.txt: 724, 1237, 3572, 3920, 4688, 7031,

zhijiew avatar Jan 12 '18 07:01 zhijiew

@littlebelly , can you get a successful training result? Do you change any of author's codes?

lzj322 avatar Jan 15 '18 07:01 lzj322

I can train the network after delete training data I mentioned above, but there are still some errors in validate process, so I just give it up and use caffe code provided by offical author.

I feel that these errors caused by image slice operation which is needed in cityscape dataset because of the large image size, but unnecessary in voc2012 dataset.

I hope @ZijunDeng can help to solve these errors~

zhijiew avatar Jan 15 '18 07:01 zhijiew

@littlebelly thank you for your reply. I changed the codes and remove the slice operation. But I still meet a problem in the model. error

Have you meet this error before?

lzj322 avatar Jan 15 '18 09:01 lzj322

Maybe trying to revert to this commit may help solve all the problems. After that try to make the code compatible with python3 version.

iliadsouti avatar Jan 15 '18 09:01 iliadsouti

Hi everyone, it seems there are some problems with the VOC dataset loader. I will check the code and fix bugs later on (I am busy for other things currently :-( ).

zijundeng avatar Jan 15 '18 10:01 zijundeng

@iliadsouti It really works, thank you very much.

lzj322 avatar Jan 15 '18 15:01 lzj322

+1 confirmed that @iliadsouti 's suggestion of reverting to that commit averts the issue with the input to batch norm for PSPNet. If I have time I'll try to PR a fix to current master

peteflorence avatar Jan 31 '18 02:01 peteflorence