pytorch-0.4-yolov3 icon indicating copy to clipboard operation
pytorch-0.4-yolov3 copied to clipboard

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. ...: 1333

Open mrkieumy opened this issue 5 years ago • 14 comments

Hi @andy-yun , I meet this error (the same with #33): Traceback (most recent call last): File "train.py", line 385, in main() File "train.py", line 160, in main nsamples = train(epoch) File "train.py", line 229, in train for batch_idx, (data, target) in enumerate(train_loader): File "/home/kieumy/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 623, in next return self._process_next_batch(batch) File "/home/kieumy/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch raise batch.exc_type(batch.exc_msg) RuntimeError: Traceback (most recent call last): File "/home/kieumy/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/kieumy/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in default_collate return [default_collate(samples) for samples in transposed] File "/home/kieumy/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 232, in return [default_collate(samples) for samples in transposed] File "/home/kieumy/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 209, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 480 and 416 in dimension 2 at /pytorch/aten/src/TH/generic/THTensorMoreMath.cpp:1333

I think that the problem cause the get_difference_scale() method, because when I turn of it by setting shape = (img.width,img.height), the error has gone. I set my image with and height is (544 x 480) because original size is 640x512 and w don't want to scale too small (416x416) so I used 544 x 480 (it still divides for 32) Do you have any recommendation to fix this error? Thanks & best regards.

mrkieumy avatar Mar 18 '19 16:03 mrkieumy

@mrkieumy You can refer the same issue at https://github.com/marvis/pytorch-yolo2/issues/89

Here's the reason. https://medium.com/@yvanscher/pytorch-tip-yielding-image-sizes-6a776eb4115b

The solution is set the batch_size=1 or in the get_different_scale() you should change the 64 to self.batch_size. (re-download dataset.py)

andy-yun avatar Mar 19 '19 11:03 andy-yun

Thanks @andy-yun. I set 64 to self.batch_size (re-downloaded the dataset file), but it's still error. If I set batch_size=1 that means dataloader load every 1 image and the network train with batch=1? Is that right or not? If it's, that not a good because we want to train with the largest possible batch_size. Any help is appreciated. Thanks & Best regards.

mrkieumy avatar Mar 25 '19 14:03 mrkieumy

@mickolka yup. set batch_size=1 is recommended for test environment. How many GPUs do you use? I wonder the different image sets are used together.

andy-yun avatar Mar 27 '19 00:03 andy-yun

Hi @andy-yun , I have only 1 GPU, for test step the batchsize is always 2 images, when I set 1, it's error. But btw, for training, we don't want to set batch_size=1, right? Because we want to train with as large batchsize as possible. But my GPU on can train V3 with batchsize = 8 is maximum (GTX 1080). Now, I uncomment the line get_different_scale(), only train with the constant shape (544,480). But the result will be bad comparing to get difference scale. How can I get difference scale without set batchsize=1? Thanks.

mrkieumy avatar Mar 27 '19 10:03 mrkieumy

Hi @mrkieumy Would you change the following 64 to self.batch_size ? 57th line of dataset.py: if index % 64 == 0: --> if index % (self.batch_size * 10) == 0:

After checking the above code, please report me. thanks.

andy-yun avatar Mar 30 '19 04:03 andy-yun

Hi @andy-yun , I changed everything as the same you said but it's still error. I also try the crop=True with those sizes but it still errors the same. Do you know where is the problem? How can you train by the difference_scale without error? I don't know what I understand correctly is that every 10*batch_size the shape will be random in the get_differnece_scale function (but the same width and height), and the data will load images with that shape. It supposed to be the same shape within the batch_size, in contrast, it raises the error difference dimension in the batch_size. How to set every batch have the same shape? Thanks.

mrkieumy avatar Apr 02 '19 21:04 mrkieumy

@mrkieumy I don't know what the exact problem is. But, in my opinion, the codes are well working with other people thus I am doubting your dataset and environment. Cheers.

andy-yun avatar Apr 02 '19 23:04 andy-yun

@andy-yun , Thanks for your reply. After printed the index I saw that the dataloader loaded images with shuffle so the index not in order. I recognize that self.seen increases in order, so I changed: if index % (self.batch_size10) == 0: --> if self.seen % (self.batch_size10) ==0: It has been worked until now for 20 epochs. I hope that was the final case to solve this problem. I don't know it correct or not. I'll let you know if any other thing.

On other thing is that: in your repo, you should replace the line 425 and 427 in darknet.py: save_fc(fc, model) --> save_fc(fp,model). Because fc was not declared, it must be fp (file). because yolov3 doesn't have fully connected layer so nobody used it. But in my case, I add more some fully connected. The new problem is that I can not save weight file of fully connected until now because it said that the fc doesn't have bias and weight properties in save_fc function in cfg.py file. I have been saved model first. The last is that, can you help me to explain the #59 Thanks.

mrkieumy avatar Apr 03 '19 09:04 mrkieumy

Thanks @mrkieumy I updated codes.

andy-yun avatar Apr 03 '19 13:04 andy-yun

my code has modified,but the question already still exist. Traceback (most recent call last): File "train.py", line 379, in main() File "train.py", line 156, in main nsamples = train(epoch) File "train.py", line 222, in train for batch_idx, (data, target) in enumerate(train_loader): File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 336, in next return self._process_next_batch(batch) File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 357, in _process_next_batch raise batch.exc_type(batch.exc_msg) RuntimeError: Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 106, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 187, in default_collate return [default_collate(samples) for samples in transposed] File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 164, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 416 and 480 in dimension 2 at /pytorch/aten/src/TH/generic/THTensorMath.cpp:3616

I train voc dataset,the size of image is 416*416, batch_size = 8, the number of GPU is 1. Do you have any recommendation to fix this error?

zhangguotai avatar May 09 '19 10:05 zhangguotai

@zhangguotai I updated the code dataset.py and train.py. Try them. Refer to https://discuss.pytorch.org/t/runtimeerror-invalid-argument-0-sizes-of-tensors-must-match-except-in-dimension-0-got-3-and-2-in-dimension-1/23890/15

andy-yun avatar May 09 '19 14:05 andy-yun

I have same problem and It seems I have downloaded the updated source code. can you help me with this problem?? I'm having problem after epoch 15 에러

richard0326 avatar Jun 14 '19 05:06 richard0326

I met the same problem after epoch 15. (pytorch1.0, python 3.6.3, my own data, 4 gpus) image

through reading previous problems and solutions, I guess the problem is in the dataset.py line53: def get_different_scale(self): if self.seen < 4000self.batch_size: wh = 1332 # 416 elif self.seen < 8000*self.batch_size: wh = (random.randint(0,3) + 13)32 # 416, 480 elif self.seen < 12000self.batch_size: wh = (random.randint(0,5) + 12)*32 # 384, ..., 544 ..... so maybe we get different shape in the same batch(dataset.py line 14): def custom_collate(batch): data = torch.stack([item[0] for item in batch], 0) [X,X,416,X] and [X,X,317,X]

although shape transfer happended after self.seen < xx*self.batch_size, maybe the errror due to multi-gpu? I just have this guess, but I don't know how to solve it, I found there are many people have same question, maybe the problem is important, looking forward to your reply~

sgflower66 avatar Aug 29 '19 07:08 sgflower66

in my case, the problem disappeared when I didn't use savemodel() function. I suppose that the problem appears after cur_model.save_weights(). also in my case i have train dataset that len(train_dataset)%batch_size = 0

Ginbor avatar Nov 02 '19 11:11 Ginbor