SSH-pytorch About multiGPU

Hello,I try to add Parallel in train.py (cus train_dist.py can't really work fine on my cluster) by add DataParallel function and move the optim under net

net.to(device)
    net = torch.nn.DataParallel(net,device_ids=[0,1,2,3,4,6,7,8])
    optimizer = optim.SGD(net.parameters(), lr=cfg.TRAIN.LEARNING_RATE,
                          momentum=cfg.TRAIN.MOMENTUM, weight_decay=cfg.TRAIN.WEIGHT_DECAY)
net.train()
train(net, optimizer, imdb, roidb, arg)

and comment

    #assert len(str(arg.gpu_ids)) == 1, "only single gpu is supported, " \
                  #                     "use train_dist.py for multiple gpu support"

    # os.environ['CUDA_VISIBLE_DEVICES'] = str(arg.gpu_ids)

but it still run on only device 0

what wrong with my code? 感谢~

Jan 11 '19 02:01 yaoing

Hi yaoing, Dataparallel library in pytorch works as follows: splitting the input across the specified devices by chunking in the batch dimension (other objects will be copied once per device).

The batch size should be larger than the number of GPUs used.

Currently my code do not support batch size other than one, so that you would not able to use Dataparallel. If you want to use Dataparallel, you need to modify data loader, anchor layer in order to support multi batch. If you have a cluster and want to train on multiple gpu, try to use distributed training, look at example in train_dist.sh

Jan 11 '19 02:01 dechunwang

OK! I know. Thanks~

Jan 11 '19 02:01 yaoing

Hello,dechunwang,bother again.

Have you provide any saved model for us to test?

Here have some bugs when save checkpoints on my centos cluster:

THCudaCheck FAIL file=/pytorch/torch/csrc/generic/serialization.cpp line=15 error=30 : unknown error
Traceback (most recent call last):
  File "train.py", line 222, in <module>
    train(net, optimizer, imdb, roidb, arg)
  File "train.py", line 183, in train
    print("check point saved")
  File "/home/yao/apps/SSH-pytorch/model/network.py", line 116, in save_check_point
    }, path)
  File "/home/yao/anaconda3/envs/yao/lib/python3.6/site-packages/torch/serialization.py", line 218, in save
    return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
  File "/home/yao/anaconda3/envs/yao/lib/python3.6/site-packages/torch/serialization.py", line 143, in _with_file_like
    return body(f)
  File "/home/yao/anaconda3/envs/yao/lib/python3.6/site-packages/torch/serialization.py", line 218, in <lambda>
    return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
  File "/home/yao/anaconda3/envs/yao/lib/python3.6/site-packages/torch/serialization.py", line 297, in _save
    serialized_storages[key]._write_file(f, _should_read_directly(f))
RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/csrc/generic/serialization.cpp:15

But at the same time,I can save other model coding by myself.

Jan 11 '19 08:01 yaoing

Please check whether model save directory exists or not. I am traveling right now, I will upload as soon as I get back. But you should able to get same results in 4 GPU, 22000 tiers. When you evaluate, set threshold to 0.05

Jan 11 '19 08:01 dechunwang

好人一生平安!

Jan 11 '19 08:01 yaoing

https://drive.google.com/file/d/19bmuol6CbSqL3pj9SBzUL6UhrxC3XYbC/view Sorry for late reply

Mar 08 '19 02:03 dechunwang

Please check whether model save directory exists or not. I am traveling right now, I will upload as soon as I get back. But you should able to get same results in 4 GPU, 22000 tiers. When you evaluate, set threshold to 0.05

hi，if you set threshold to 0.05, there will be many wrong bboxes which cannot be used in practice.

Jun 20 '19 12:06 westnight

This is a very common threshold setting used in wider face benchmark. It boots up recall. The official SSH repo also used this setting.

Jun 22 '19 00:06 dechunwang

ok, I know, thank you. So when test on wider face, what we care about is recall. When I test mtcnn, the threshold can be set very low too, like 0.05?

Jun 26 '19 01:06 westnight

SSH-pytorch SSH-pytorch copied to clipboard

About multiGPU

SSH-pytorch
SSH-pytorch copied to clipboard