Detectron.pytorch
Detectron.pytorch copied to clipboard
Update subprocess.py
Description: I use 3 GPUs to train the network and interrupt at some point before the final step, which means I only save the checkpoint but not config. Then, I try to test the model, which unexpectedly failed and the error message is start = subinds[i][0], list index out of range
.
Issue: I think at the line 64, instead of writing gpu_inds = range(cfg.NUM_GPUS)
, I think it is much more reasonable to write gpu_inds = range(NUM_GPUS)
. Let me explain it.
After import the yaml and config file in subprocess.py
, cfg.NUM_GPUs is 8 instead of 3 (well, in train_net_step, there is a statement which assigns cfg.NUM_GPUs = torch.cuda.device_count(), so it does not crash), and NUM_GPUs = torch.cuda.device_count() = 3 in my case, and it turns out that at line 56, the size of subins
is 3.
I choose to let cuda see all my GPUs, Later, at line 64, if gpu_inds = range(cfg.NUM_GPUS)
is used, the size of gpu_indx
is 8, which then will crash at line 68. Therefore, at line 64, gpus_inds = range(NUM_GPUs)
is much more reasonable.
Please check and see if my solution is correct or not. Thanks.
:+1: