pytorch-ssd icon indicating copy to clipboard operation
pytorch-ssd copied to clipboard

Training on two GPUs

Open AlanStark opened this issue 6 years ago • 5 comments

Hello @qfgaohao ,

I am trying to set DEVICE = torch.device('cuda:0' if torch.cuda.is_available() and args.use_cuda else 'cpu') DEVICE = torch.device('cuda:1' if torch.cuda.is_available() and args.use_cuda else 'cpu') And run two experiments simultaneously. The first one is working fine, and occupies a reasonable memory of GPU. But the second does not work, no matter how small the batch size is. Plus, GPU 1 has enough free memory for the another run. Do you have any idea on this kind of issue?

AlanStark avatar Dec 19 '18 21:12 AlanStark

@AlanStark I didn't test the code in multiple GPU environment. https://github.com/pytorch/examples/tree/master/imagenet may be used as a reference. Good luck!

qfgaohao avatar Dec 19 '18 21:12 qfgaohao

Hello @AlanStark, I have changed some lines in train_ssd.py to make all GPUs available and it worked. You can manipulate the train and testing functions as:

def train(loader, net, criterion, optimizer, device, debug_steps=100, epoch=-1): _net = nn.DataParallel(net) net.train(True)

def test(loader, net, criterion, device): net = nn.DataParallel(net) net.eval()_

CoskunGorkem avatar Mar 09 '20 10:03 CoskunGorkem

Hi @gorkem7 , Didn't you get this error below?

RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm)

I got this somehow when I tried to use your solution in vgg16-ssd training. I want to know how to fix it If you already solved this.

AiueoABC avatar Apr 20 '20 09:04 AiueoABC

Hi @gorkem7 , @AiueoABC,

I got this same error, in vgg16-ssd training, using net = nn.DataParallel(net):

RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm)

Did you find a way around it?

donbonjenbi avatar Sep 22 '20 01:09 donbonjenbi

Hello @AlanStark, I have changed some lines in train_ssd.py to make all GPUs available and it worked. You can manipulate the train and testing functions as:

def train(loader, net, criterion, optimizer, device, debug_steps=100, epoch=-1): _net = nn.DataParallel(net) net.train(True)

def test(loader, net, criterion, device): net = nn.DataParallel(net) net.eval()_

Hi, I follow your instructions, and I got the same error as above, RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm) did you have encounter this problem?

shiyuetianqiang avatar Jan 07 '21 12:01 shiyuetianqiang