pytorch-0.4-yolov3
pytorch-0.4-yolov3 copied to clipboard
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. ...: 1333
Hi @andy-yun ,
I meet this error (the same with #33):
Traceback (most recent call last):
File "train.py", line 385, in
I think that the problem cause the get_difference_scale() method, because when I turn of it by setting shape = (img.width,img.height), the error has gone. I set my image with and height is (544 x 480) because original size is 640x512 and w don't want to scale too small (416x416) so I used 544 x 480 (it still divides for 32) Do you have any recommendation to fix this error? Thanks & best regards.
@mrkieumy You can refer the same issue at https://github.com/marvis/pytorch-yolo2/issues/89
Here's the reason. https://medium.com/@yvanscher/pytorch-tip-yielding-image-sizes-6a776eb4115b
The solution is set the batch_size=1 or in the get_different_scale() you should change the 64 to self.batch_size. (re-download dataset.py)
Thanks @andy-yun. I set 64 to self.batch_size (re-downloaded the dataset file), but it's still error. If I set batch_size=1 that means dataloader load every 1 image and the network train with batch=1? Is that right or not? If it's, that not a good because we want to train with the largest possible batch_size. Any help is appreciated. Thanks & Best regards.
@mickolka yup. set batch_size=1 is recommended for test environment. How many GPUs do you use? I wonder the different image sets are used together.
Hi @andy-yun , I have only 1 GPU, for test step the batchsize is always 2 images, when I set 1, it's error. But btw, for training, we don't want to set batch_size=1, right? Because we want to train with as large batchsize as possible. But my GPU on can train V3 with batchsize = 8 is maximum (GTX 1080). Now, I uncomment the line get_different_scale(), only train with the constant shape (544,480). But the result will be bad comparing to get difference scale. How can I get difference scale without set batchsize=1? Thanks.
Hi @mrkieumy Would you change the following 64 to self.batch_size ? 57th line of dataset.py: if index % 64 == 0: --> if index % (self.batch_size * 10) == 0:
After checking the above code, please report me. thanks.
Hi @andy-yun , I changed everything as the same you said but it's still error. I also try the crop=True with those sizes but it still errors the same. Do you know where is the problem? How can you train by the difference_scale without error? I don't know what I understand correctly is that every 10*batch_size the shape will be random in the get_differnece_scale function (but the same width and height), and the data will load images with that shape. It supposed to be the same shape within the batch_size, in contrast, it raises the error difference dimension in the batch_size. How to set every batch have the same shape? Thanks.
@mrkieumy I don't know what the exact problem is. But, in my opinion, the codes are well working with other people thus I am doubting your dataset and environment. Cheers.
@andy-yun , Thanks for your reply. After printed the index I saw that the dataloader loaded images with shuffle so the index not in order. I recognize that self.seen increases in order, so I changed: if index % (self.batch_size10) == 0: --> if self.seen % (self.batch_size10) ==0: It has been worked until now for 20 epochs. I hope that was the final case to solve this problem. I don't know it correct or not. I'll let you know if any other thing.
On other thing is that: in your repo, you should replace the line 425 and 427 in darknet.py: save_fc(fc, model) --> save_fc(fp,model). Because fc was not declared, it must be fp (file). because yolov3 doesn't have fully connected layer so nobody used it. But in my case, I add more some fully connected. The new problem is that I can not save weight file of fully connected until now because it said that the fc doesn't have bias and weight properties in save_fc function in cfg.py file. I have been saved model first. The last is that, can you help me to explain the #59 Thanks.
Thanks @mrkieumy I updated codes.
my code has modified,but the question already still exist.
Traceback (most recent call last):
File "train.py", line 379, in
I train voc dataset,the size of image is 416*416, batch_size = 8, the number of GPU is 1. Do you have any recommendation to fix this error?
@zhangguotai I updated the code dataset.py and train.py. Try them. Refer to https://discuss.pytorch.org/t/runtimeerror-invalid-argument-0-sizes-of-tensors-must-match-except-in-dimension-0-got-3-and-2-in-dimension-1/23890/15
I have same problem and It seems I have downloaded the updated source code.
can you help me with this problem??
I'm having problem after epoch 15
I met the same problem after epoch 15. (pytorch1.0, python 3.6.3, my own data, 4 gpus)
through reading previous problems and solutions, I guess the problem is in the dataset.py line53: def get_different_scale(self): if self.seen < 4000self.batch_size: wh = 1332 # 416 elif self.seen < 8000*self.batch_size: wh = (random.randint(0,3) + 13)32 # 416, 480 elif self.seen < 12000self.batch_size: wh = (random.randint(0,5) + 12)*32 # 384, ..., 544 ..... so maybe we get different shape in the same batch(dataset.py line 14): def custom_collate(batch): data = torch.stack([item[0] for item in batch], 0) [X,X,416,X] and [X,X,317,X]
although shape transfer happended after self.seen < xx*self.batch_size, maybe the errror due to multi-gpu? I just have this guess, but I don't know how to solve it, I found there are many people have same question, maybe the problem is important, looking forward to your reply~
in my case, the problem disappeared when I didn't use savemodel() function. I suppose that the problem appears after cur_model.save_weights(). also in my case i have train dataset that len(train_dataset)%batch_size = 0