SSH-pytorch
SSH-pytorch copied to clipboard
About multiGPU
Hello,I try to add Parallel in train.py (cus train_dist.py can't really work fine on my cluster) by add DataParallel function and move the optim under net
net.to(device)
net = torch.nn.DataParallel(net,device_ids=[0,1,2,3,4,6,7,8])
optimizer = optim.SGD(net.parameters(), lr=cfg.TRAIN.LEARNING_RATE,
momentum=cfg.TRAIN.MOMENTUM, weight_decay=cfg.TRAIN.WEIGHT_DECAY)
net.train()
train(net, optimizer, imdb, roidb, arg)
and comment
#assert len(str(arg.gpu_ids)) == 1, "only single gpu is supported, " \
# "use train_dist.py for multiple gpu support"
# os.environ['CUDA_VISIBLE_DEVICES'] = str(arg.gpu_ids)
but it still run on only device 0
what wrong with my code? 感谢~
Hi yaoing, Dataparallel library in pytorch works as follows: splitting the input across the specified devices by chunking in the batch dimension (other objects will be copied once per device).
The batch size should be larger than the number of GPUs used.
Currently my code do not support batch size other than one, so that you would not able to use Dataparallel. If you want to use Dataparallel, you need to modify data loader, anchor layer in order to support multi batch. If you have a cluster and want to train on multiple gpu, try to use distributed training, look at example in train_dist.sh
OK! I know. Thanks~
Hello,dechunwang,bother again.
Have you provide any saved model for us to test?
Here have some bugs when save checkpoints on my centos cluster:
THCudaCheck FAIL file=/pytorch/torch/csrc/generic/serialization.cpp line=15 error=30 : unknown error
Traceback (most recent call last):
File "train.py", line 222, in <module>
train(net, optimizer, imdb, roidb, arg)
File "train.py", line 183, in train
print("check point saved")
File "/home/yao/apps/SSH-pytorch/model/network.py", line 116, in save_check_point
}, path)
File "/home/yao/anaconda3/envs/yao/lib/python3.6/site-packages/torch/serialization.py", line 218, in save
return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
File "/home/yao/anaconda3/envs/yao/lib/python3.6/site-packages/torch/serialization.py", line 143, in _with_file_like
return body(f)
File "/home/yao/anaconda3/envs/yao/lib/python3.6/site-packages/torch/serialization.py", line 218, in <lambda>
return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
File "/home/yao/anaconda3/envs/yao/lib/python3.6/site-packages/torch/serialization.py", line 297, in _save
serialized_storages[key]._write_file(f, _should_read_directly(f))
RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/csrc/generic/serialization.cpp:15
But at the same time,I can save other model coding by myself.
Please check whether model save directory exists or not. I am traveling right now, I will upload as soon as I get back. But you should able to get same results in 4 GPU, 22000 tiers. When you evaluate, set threshold to 0.05
好人一生平安!
https://drive.google.com/file/d/19bmuol6CbSqL3pj9SBzUL6UhrxC3XYbC/view Sorry for late reply
Please check whether model save directory exists or not. I am traveling right now, I will upload as soon as I get back. But you should able to get same results in 4 GPU, 22000 tiers. When you evaluate, set threshold to 0.05
hi,if you set threshold to 0.05, there will be many wrong bboxes which cannot be used in practice.
This is a very common threshold setting used in wider face benchmark. It boots up recall. The official SSH repo also used this setting.
ok, I know, thank you. So when test on wider face, what we care about is recall. When I test mtcnn, the threshold can be set very low too, like 0.05?