pytorch-ssd
pytorch-ssd copied to clipboard
Training on two GPUs
Hello @qfgaohao ,
I am trying to set DEVICE = torch.device('cuda:0' if torch.cuda.is_available() and args.use_cuda else 'cpu') DEVICE = torch.device('cuda:1' if torch.cuda.is_available() and args.use_cuda else 'cpu') And run two experiments simultaneously. The first one is working fine, and occupies a reasonable memory of GPU. But the second does not work, no matter how small the batch size is. Plus, GPU 1 has enough free memory for the another run. Do you have any idea on this kind of issue?
@AlanStark I didn't test the code in multiple GPU environment. https://github.com/pytorch/examples/tree/master/imagenet may be used as a reference. Good luck!
Hello @AlanStark, I have changed some lines in train_ssd.py to make all GPUs available and it worked. You can manipulate the train and testing functions as:
def train(loader, net, criterion, optimizer, device, debug_steps=100, epoch=-1): _net = nn.DataParallel(net) net.train(True)
def test(loader, net, criterion, device): net = nn.DataParallel(net) net.eval()_
Hi @gorkem7 , Didn't you get this error below?
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm)
I got this somehow when I tried to use your solution in vgg16-ssd training. I want to know how to fix it If you already solved this.
Hi @gorkem7 , @AiueoABC,
I got this same error, in vgg16-ssd training, using net = nn.DataParallel(net):
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm)
Did you find a way around it?
Hello @AlanStark, I have changed some lines in train_ssd.py to make all GPUs available and it worked. You can manipulate the train and testing functions as:
def train(loader, net, criterion, optimizer, device, debug_steps=100, epoch=-1): _net = nn.DataParallel(net) net.train(True)
def test(loader, net, criterion, device): net = nn.DataParallel(net) net.eval()_
Hi, I follow your instructions, and I got the same error as above, RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm) did you have encounter this problem?