Detectron.pytorch icon indicating copy to clipboard operation
Detectron.pytorch copied to clipboard

Update subprocess.py

Open Constannnnnt opened this issue 6 years ago • 1 comments

Description: I use 3 GPUs to train the network and interrupt at some point before the final step, which means I only save the checkpoint but not config. Then, I try to test the model, which unexpectedly failed and the error message is start = subinds[i][0], list index out of range.

Issue: I think at the line 64, instead of writing gpu_inds = range(cfg.NUM_GPUS), I think it is much more reasonable to write gpu_inds = range(NUM_GPUS). Let me explain it.

After import the yaml and config file in subprocess.py, cfg.NUM_GPUs is 8 instead of 3 (well, in train_net_step, there is a statement which assigns cfg.NUM_GPUs = torch.cuda.device_count(), so it does not crash), and NUM_GPUs = torch.cuda.device_count() = 3 in my case, and it turns out that at line 56, the size of subins is 3.

I choose to let cuda see all my GPUs, Later, at line 64, if gpu_inds = range(cfg.NUM_GPUS) is used, the size of gpu_indx is 8, which then will crash at line 68. Therefore, at line 64, gpus_inds = range(NUM_GPUs) is much more reasonable.

Please check and see if my solution is correct or not. Thanks.

Constannnnnt avatar Jul 06 '18 04:07 Constannnnnt

:+1:

ternaus avatar Sep 30 '18 03:09 ternaus