first-order-model icon indicating copy to clipboard operation
first-order-model copied to clipboard

Multiple gpus training

Open alessiapacca opened this issue 4 years ago • 9 comments

One question: when I ran the training, I always used 1 single GPU because when I tried to use more than one the usage was always at 0%. Does the code work when all the GPUs are set to "EXCLUSIVE" mode?

alessiapacca avatar Nov 30 '20 14:11 alessiapacca

Have you specified device_ids?

AliaksandrSiarohin avatar Nov 30 '20 14:11 AliaksandrSiarohin

@AliaksandrSiarohin yes, I have always used the command that's on the readme. It may be a problem of the server when I run it, but I just wanted to understand if the code worked when the GPUs where in exclusive mode as that may be an issue with the server.

alessiapacca avatar Nov 30 '20 14:11 alessiapacca

Sorry, I have no idea what thar exclusive mode means. So you may try to see if some simple cifar multi gpu works for you. And if simple cifar with synchronous bn works.

AliaksandrSiarohin avatar Nov 30 '20 17:11 AliaksandrSiarohin

@alessiapacca may you please share with us on which database you managed to train the network? With Python 3.7.5?

Mathilda88 avatar Dec 01 '20 13:12 Mathilda88

@Eliot04 hey I trained with Vox dataset and I used python 3.6.4

alessiapacca avatar Dec 01 '20 13:12 alessiapacca

@alessiapacca Super helpful. Thanks.

Mathilda88 avatar Dec 01 '20 15:12 Mathilda88

@alessiapacca I tried to use distributed data parallel to accelerate the training, and it semms to be working. Maybe you can try this too. (but synchronized BatchNorm may have problem when dist data parallel is used, i did not test it)

SystemErrorWang avatar Jan 13 '21 06:01 SystemErrorWang

@SystemErrorWang How to use distributed data parallel to accelerate the training ?I put the model and datasets to DDP. but it seams to be not working. the GPU usage was always at 0%

Qia98 avatar Aug 23 '23 02:08 Qia98

@SystemErrorWang How to use distributed data parallel to accelerate the training ?I put the model and datasets to DDP. but it seams to be not working. the GPU usage was always at 0%

I modified the code with this repo: https://github.com/rosinality/stylegan2-pytorch adopted the ddp part of the stylegan2 code, combined with the First-Order Motion Model training code It would spend some time to read the code, but unfortunately my previous code is missing because I changed my job now I believe it's practical and not difficult, wish you good luck!

SystemErrorWang avatar Aug 23 '23 05:08 SystemErrorWang