STM-Training
STM-Training copied to clipboard
How to implement Multi-GPU for training
I've tried to implement data-parallel for training in multi-GPUs but it doesn't work. The model only runs in my first GPU. So I replace the model with ResNet with no extra tensor operation and it works normally. Maybe tensor operation renders data-parallel. Could you tell me how to make the model parallel in DataParallel or Distributed DataParallel way?
Hello, sorry to disturb you. I met the same problem. Have you solved the problem now?
Hello, sorry to disturb you. I met the same problem. Have you solved the problem now?
nope
@zzc000930 Hi, if you try nn.DataParallel for multi-GPU training, please ensure that (1) batchsize can be divided by your GPU number, (2) the second dimension of data (the number of object) should be same across batch, this can be implemented by either random sampling objects and padding zeros to none objects. (3) variable num_objects should be wrapped by tensor before putting into the model. An example of nn.DataParallel usage can be found in our basical baseline branch, hope this can provide some help.
As for distributed training, we are working on implementing it to make the multi-GPU training more flexible, commit will be done when we finished.