pytorch-retinanet icon indicating copy to clipboard operation
pytorch-retinanet copied to clipboard

Sporadic message : input tensors must be on the same device. Received cpu and cuda:0

Open wvalcke opened this issue 5 years ago • 8 comments

During training i have a sporadic message : input tensors must be on the same device. Received cpu and cuda:0 At the end it does not have any influence on the training, it works correctly, i'm just wondering if somebody else have seen the same kind of warning. I checked the code, and i cannot directly see something wrong. If somebody has the same issue, and some ideas, let me know

Kind regards

wvalcke avatar Nov 01 '20 15:11 wvalcke

I have the same issue, haven't found a solution yet.

xvyaward avatar Nov 06 '20 12:11 xvyaward

I think I just found a solution to this problem, seeing #174

I'm working on iteration 200 and it works well so far. I hope this helps :)

xvyaward avatar Nov 06 '20 12:11 xvyaward

Hi @xvyaward

Indeed that was an error in the code. Thanks for pointing this out. The reason it only occurs sporadically is that it happens when a training image is included in the training set which is a 'hard negative'. An image on which no object should be detected. When this image passes through the network you get the error. A tensor value of 0 is added to the regression loss, which is logic if no detection is expected, but it was not moved to the CUDA device in case CUDA training is used. I trained again i did not see the error appearing anymore. Can you just confirm you have at least 1 hard negative sample as well in your dataset ? Then this issue can be closed.

wvalcke avatar Nov 06 '20 17:11 wvalcke

I used train2017 of COCO dataset, I'm not sure whether hard negative samples are contained. I guess all the images in train2017 have at least one annotation.

xvyaward avatar Nov 07 '20 05:11 xvyaward

Thank you @wvalcke. I removed all hard negative samples, and it worked!

But can you help make the hard negative samples work too? I want to train, involving these hard negative samples.

ccl-private avatar Mar 17 '21 12:03 ccl-private

You need to check losses.py

                if torch.cuda.is_available():
                    alpha_factor = torch.ones(classification.shape).cuda() * alpha

                    alpha_factor = 1. - alpha_factor
                    focal_weight = classification
                    focal_weight = alpha_factor * torch.pow(focal_weight, gamma)

                    bce = -(torch.log(1.0 - classification))

                    # cls_loss = focal_weight * torch.pow(bce, gamma)
                    cls_loss = focal_weight * bce
                    classification_losses.append(cls_loss.sum())
                    regression_losses.append(torch.tensor(0).float().cuda())

The last line (around line 62) is probably without .cuda() at the end, fix this and you can train with hard negative samples. I did this and it works correctly.

wvalcke avatar Mar 17 '21 18:03 wvalcke

To be precise, it's line 65 where the fix is needed

wvalcke avatar Mar 17 '21 19:03 wvalcke

Thank you~

ccl-private avatar Mar 18 '21 01:03 ccl-private