ImageAI Multi-GPU training does not work when the hardware architecture is multinode based

Multi-GPU training does not work when the hardware architecture is multinode based

Open Edwin-Aguirre92 opened this issue 2 years ago • 0 comments

Hi,

I believe that your multi_gpu_model.py file does work when the hardware design is in one node(1CPU, and >=1GPU), however if I have a VM with multiple nodes set up in the cloud (Azure) , then you're code does not work. I've had to modify your source code logic by doing the following : 1- I basically ignored the setGPUUsage method in the DetectionModelTrainer class, and hardcode the number of GPUs like so(you don't need to do this but I did it anyways):

2-In the _create model function, when you calculate the length of the multi_gpu I commented some code to skip the multi_gpu_model function that you import. Therefore, it just return the regular model : 3- Then when we train the model I added the tf.distribute.MultiWorkerMirrorStrategy() class . Then, a scope was created where all the variables will be copied to all available (GPUs) or model replicas. Finally, the model is trained outside of the scope, like so : 4- This actually works , however as observed in the pictures below, some of the GPUs are working in parallel but never to their 100% capacity, and sometimes only 2 out of the 4 GPUs work. This results in the fact that when I train with 1 CPU and 1 GPU , the YOLOVv3 architecture trains faster than with multiple GPUs(multiworkers ).Would you guys happen to know why this is happening? Or maybe your multi_gpu_model function is able to work with multiworker set up but it does not work for me?

Any tips would be appreciated ,

Best Regards,

Edwin Aguirre

May 06 '22 16:05 Edwin-Aguirre92

ImageAI ImageAI copied to clipboard

Multi-GPU training does not work when the hardware architecture is multinode based

ImageAI
ImageAI copied to clipboard