pytorch-distributed [Discussion] mp: duplicate of torch.cuda.set_device(local_rank) and images = images.cuda(local_rank, non

[Discussion] mp: duplicate of torch.cuda.set_device(local_rank) and images = images.cuda(local_rank, non_blocking=True)

Open laoreja opened this issue 4 years ago • 4 comments

Hi there,

Great repo! I'm studying this topic, and found out that the official repo of imagenet classification also uses multiprocessing.

I noticed one place that they not only use torch.cuda.set_device(local_rank) (L144) but also set the specific gpu id everywhere (their args.gpu refers to local rank):

model.cuda(args.gpu)  # L145
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])  # L151
criterion = nn.CrossEntropyLoss().cuda(args.gpu)  # L169

loc = 'cuda:{}'.format(args.gpu)  # L183
checkpoint = torch.load(args.resume, map_location=loc)

if args.gpu is not None:  # L 282
    images = images.cuda(args.gpu, non_blocking=True)
target = target.cuda(args.gpu, non_blocking=True)

This is a bit weird. I'm wondering if you have any idea about this phenomenon?

And the doc for torch.cuda.set_device says that: "Usage of this function is discouraged in favor of device. In most cases it’s better to use CUDA_VISIBLE_DEVICES environmental variable."

Also, I noticed that even using mp, sometimes I cannot kill all the processes by Ctrl+D, but need to specifically kill the processes by their PID. Not sure if you ever met this problem?

Thank you!

Jun 26 '20 12:06 laoreja

Of course, if you do torch.cuda.set_device(local_rank) it is okay to use model.cuda() instead of model.cuda(args.gpu). The official repo kept the old fashion of best practice based on previous versions.

And what do u mean by 'cannot kill all the processes by Ctrl+D'? Everything goes well for me.

Jun 27 '20 10:06 tczhangzhi

Thank you! I mean when I want to terminate the training, I would use Ctrl+D in the terminal. Sometimes one process would still be there occupying the GPU. Then I have to kill the process by its ID.

Jun 28 '20 00:06 laoreja

Thank you! I mean when I want to terminate the training, I would use Ctrl+D in the terminal. Sometimes one process would still be there occupying the GPU. Then I have to kill the process by its ID.

Me too... Do you have any advice?

Jul 03 '20 03:07 lartpang

I finally encountered a situation where the GPU memory was not cleared after using ctrl+c to kill the process today. It happened when I used a custom CUDA module. If you encounter the same situation, maybe you can refer to https://github.com/pytorch/fairseq/issues/487, which works for me (even though does not look elegant).

I will try to find the cause of this problem, but it seems not easy to locate...

Jul 11 '20 08:07 tczhangzhi

pytorch-distributed pytorch-distributed copied to clipboard

[Discussion] mp: duplicate of torch.cuda.set_device(local_rank) and images = images.cuda(local_rank, non_blocking=True)

pytorch-distributed
pytorch-distributed copied to clipboard