pytorch-distributed
pytorch-distributed copied to clipboard
[Discussion] mp: duplicate of torch.cuda.set_device(local_rank) and images = images.cuda(local_rank, non_blocking=True)
Hi there,
Great repo! I'm studying this topic, and found out that the official repo of imagenet classification also uses multiprocessing.
I noticed one place that they not only use
torch.cuda.set_device(local_rank)
(L144)
but also set the specific gpu id everywhere (their args.gpu
refers to local rank):
model.cuda(args.gpu) # L145
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu]) # L151
criterion = nn.CrossEntropyLoss().cuda(args.gpu) # L169
loc = 'cuda:{}'.format(args.gpu) # L183
checkpoint = torch.load(args.resume, map_location=loc)
if args.gpu is not None: # L 282
images = images.cuda(args.gpu, non_blocking=True)
target = target.cuda(args.gpu, non_blocking=True)
This is a bit weird. I'm wondering if you have any idea about this phenomenon?
And the doc for torch.cuda.set_device
says that:
"Usage of this function is discouraged in favor of device. In most cases it’s better to use CUDA_VISIBLE_DEVICES environmental variable."
Also, I noticed that even using mp, sometimes I cannot kill all the processes by Ctrl+D, but need to specifically kill the processes by their PID. Not sure if you ever met this problem?
Thank you!
Of course, if you do torch.cuda.set_device(local_rank)
it is okay to use model.cuda()
instead of model.cuda(args.gpu)
. The official repo kept the old fashion of best practice based on previous versions.
And what do u mean by 'cannot kill all the processes by Ctrl+D'? Everything goes well for me.
Thank you! I mean when I want to terminate the training, I would use Ctrl+D in the terminal. Sometimes one process would still be there occupying the GPU. Then I have to kill the process by its ID.
Thank you! I mean when I want to terminate the training, I would use Ctrl+D in the terminal. Sometimes one process would still be there occupying the GPU. Then I have to kill the process by its ID.
Me too... Do you have any advice?
I finally encountered a situation where the GPU memory was not cleared after using ctrl+c to kill the process today. It happened when I used a custom CUDA module. If you encounter the same situation, maybe you can refer to https://github.com/pytorch/fairseq/issues/487, which works for me (even though does not look elegant).
I will try to find the cause of this problem, but it seems not easy to locate...