PConv-Keras icon indicating copy to clipboard operation
PConv-Keras copied to clipboard

Questions about multi-GPU training

Open xhh232018 opened this issue 6 years ago • 13 comments

Hi, due to the quite long training time, I want to know how can I use the keras.utils.multi_gpu_model?

xhh232018 avatar Jul 05 '18 11:07 xhh232018

Hello xhh232018, Have you successfully trained the network?

TrinhQuocNguyen avatar Jul 10 '18 01:07 TrinhQuocNguyen

Hello TrinhQuoc, I have emailed to you about my latest training results.

xhh232018 avatar Jul 11 '18 02:07 xhh232018

Hi xhh232018, Thank you, I currently testing it and modifying the source for my own masks 😄 . It's running, but it is gonna take some times to retrain the networks and validate them.

TrinhQuocNguyen avatar Jul 11 '18 02:07 TrinhQuocNguyen

@TrinhQuocNguyen how did u train yr own?

NerminSalem avatar Jul 31 '18 13:07 NerminSalem

@xhh232018 I am trying to use multi GPUs for training too. However, it seems to consume a lot of CPU. Did that occur in your training procedure? Also, I find it consumes much more space on the first GPU, which makes it hard to fully use all GPU. Could you give me any advice on that? Thank you a lot.

Mendel1 avatar Sep 14 '18 08:09 Mendel1

@xhh232018 I am trying to use multi GPUs for training too. However, it seems to consume a lot of CPU. Did that occur in your training procedure? Also, I find it consumes much more space on the first GPU, which makes it hard to fully use all GPU. Could you give me any advice on that? Thank you a lot.

I met the same problem. Have you guys found the way to deal with it?

Mistariano avatar Feb 02 '19 08:02 Mistariano

@xhh232018 I am trying to use multi GPUs for training,could you give me any advice on that? And when i trained on my datasets ,such ZeroDivisionError " Found 0 images belonging to 0 classes." appeared,how can i solve this problem?I need your help,thank you!

ZDD2009 avatar Feb 15 '19 01:02 ZDD2009

@Mistariano Have you solved your problem?i have the same problem,thank you!

ZDD2009 avatar Feb 15 '19 06:02 ZDD2009

@ZDD2009 I tried to build the models on my CPU first and then used multi_gpu_model to create parallel models on GPUs, and it worked.

My code likes this:

# origin version:

# model = build_model()
# model.compile(...)
# model.fit(...)

########################

# multi-gpu version

with tf.device('/cpu:0'):
    model = build_model()

parallel_model = multi_gpu_model(model)

parallel_model.compile(...)
parallel_model.fit(...)  # compile & fit the parallel model, so it can be trained on multiple gpus

model.save(...)  # and save the template model

You can perform this trick on both pconv_model and vgg. It can exactly speed up the training.

However, the first gpu still used much more mem than others after I did that. I set log_device_placement=True and analyzed the log, than I found that model.compile works on just one gpu, so all of the loss were computed on /gpu:0.

I have no idea how to deal with the problem.

Mistariano avatar Feb 15 '19 07:02 Mistariano

@Mistariano thank you very much! do you have the problem " Found 0 images belonging to 0 classes." when you train your datasets? very thanks! i need your help!

ZDD2009 avatar Feb 15 '19 08:02 ZDD2009

I've also been playing with multi-GPU implementation, but I've not been able to see any successful speedups. Seems like the VGG loss evaluations always happen on the first GPU, and so it doesn't scale well. If anyone figures out a solution for this, it'd be awesome.

MathiasGruber avatar Mar 01 '19 12:03 MathiasGruber

I've also been playing with multi-GPU implementation, but I've not been able to see any successful speedups. Seems like the VGG loss evaluations always happen on the first GPU, and so it doesn't scale well. If anyone figures out a solution for this, it'd be awesome. In pcony_model.py file,
in line 22 you must change the gpus=8

def init(self, img_rows=512, img_cols=512, vgg_weights="imagenet", inference_only=False, net_name='default', gpus=8, vgg_device=None):

jiguanglu avatar Jun 27 '19 12:06 jiguanglu

@xhh232018 I am trying to use multi GPUs for training,could you give me any advice on that? And when i trained on my datasets ,such ZeroDivisionError " Found 0 images belonging to 0 classes." appeared,how can i solve this problem?I need your help,thank you!

Here is solution. 'ZeroDivisionError' caused by the wrong path shows "Found 0 images belonging to 1 classes. If you do on your own dataset or imagenet dataset, make sure pick out data in the directory. For example code, check notebooks/Step4/ "#Pick out an example codeline" with next(train / val / test_generator).

Best regards,

ghost avatar Aug 20 '19 02:08 ghost