PConv-Keras
PConv-Keras copied to clipboard
Questions about multi-GPU training
Hi, due to the quite long training time, I want to know how can I use the keras.utils.multi_gpu_model?
Hello xhh232018, Have you successfully trained the network?
Hello TrinhQuoc, I have emailed to you about my latest training results.
Hi xhh232018, Thank you, I currently testing it and modifying the source for my own masks 😄 . It's running, but it is gonna take some times to retrain the networks and validate them.
@TrinhQuocNguyen how did u train yr own?
@xhh232018 I am trying to use multi GPUs for training too. However, it seems to consume a lot of CPU. Did that occur in your training procedure? Also, I find it consumes much more space on the first GPU, which makes it hard to fully use all GPU. Could you give me any advice on that? Thank you a lot.
@xhh232018 I am trying to use multi GPUs for training too. However, it seems to consume a lot of CPU. Did that occur in your training procedure? Also, I find it consumes much more space on the first GPU, which makes it hard to fully use all GPU. Could you give me any advice on that? Thank you a lot.
I met the same problem. Have you guys found the way to deal with it?
@xhh232018 I am trying to use multi GPUs for training,could you give me any advice on that? And when i trained on my datasets ,such ZeroDivisionError " Found 0 images belonging to 0 classes." appeared,how can i solve this problem?I need your help,thank you!
@Mistariano Have you solved your problem?i have the same problem,thank you!
@ZDD2009 I tried to build the models on my CPU first and then used multi_gpu_model
to create parallel models on GPUs, and it worked.
My code likes this:
# origin version:
# model = build_model()
# model.compile(...)
# model.fit(...)
########################
# multi-gpu version
with tf.device('/cpu:0'):
model = build_model()
parallel_model = multi_gpu_model(model)
parallel_model.compile(...)
parallel_model.fit(...) # compile & fit the parallel model, so it can be trained on multiple gpus
model.save(...) # and save the template model
You can perform this trick on both pconv_model and vgg. It can exactly speed up the training.
However, the first gpu still used much more mem than others after I did that.
I set log_device_placement=True
and analyzed the log, than I found that model.compile
works on just one gpu, so all of the loss were computed on /gpu:0
.
I have no idea how to deal with the problem.
@Mistariano thank you very much! do you have the problem " Found 0 images belonging to 0 classes." when you train your datasets? very thanks! i need your help!
I've also been playing with multi-GPU implementation, but I've not been able to see any successful speedups. Seems like the VGG loss evaluations always happen on the first GPU, and so it doesn't scale well. If anyone figures out a solution for this, it'd be awesome.
I've also been playing with multi-GPU implementation, but I've not been able to see any successful speedups. Seems like the VGG loss evaluations always happen on the first GPU, and so it doesn't scale well. If anyone figures out a solution for this, it'd be awesome. In pcony_model.py file,
in line 22 you must change the gpus=8
def init(self, img_rows=512, img_cols=512, vgg_weights="imagenet", inference_only=False, net_name='default', gpus=8, vgg_device=None):
@xhh232018 I am trying to use multi GPUs for training,could you give me any advice on that? And when i trained on my datasets ,such ZeroDivisionError " Found 0 images belonging to 0 classes." appeared,how can i solve this problem?I need your help,thank you!
Here is solution. 'ZeroDivisionError' caused by the wrong path shows "Found 0 images belonging to 1 classes. If you do on your own dataset or imagenet dataset, make sure pick out data in the directory. For example code, check notebooks/Step4/ "#Pick out an example codeline" with next(train / val / test_generator).
Best regards,