STM-Training icon indicating copy to clipboard operation
STM-Training copied to clipboard

How to implement Multi-GPU for training

Open nku-zhichengzhang opened this issue 3 years ago • 3 comments

I've tried to implement data-parallel for training in multi-GPUs but it doesn't work. The model only runs in my first GPU. So I replace the model with ResNet with no extra tensor operation and it works normally. Maybe tensor operation renders data-parallel. Could you tell me how to make the model parallel in DataParallel or Distributed DataParallel way?

nku-zhichengzhang avatar May 25 '21 01:05 nku-zhichengzhang

Hello, sorry to disturb you. I met the same problem. Have you solved the problem now?

upxinxin avatar Jun 17 '21 01:06 upxinxin

Hello, sorry to disturb you. I met the same problem. Have you solved the problem now?

nope

nku-zhichengzhang avatar Jun 17 '21 01:06 nku-zhichengzhang

@zzc000930 Hi, if you try nn.DataParallel for multi-GPU training, please ensure that (1) batchsize can be divided by your GPU number, (2) the second dimension of data (the number of object) should be same across batch, this can be implemented by either random sampling objects and padding zeros to none objects. (3) variable num_objects should be wrapped by tensor before putting into the model. An example of nn.DataParallel usage can be found in our basical baseline branch, hope this can provide some help.

As for distributed training, we are working on implementing it to make the multi-GPU training more flexible, commit will be done when we finished.

lyxok1 avatar Jun 22 '21 15:06 lyxok1