ImageAI
ImageAI copied to clipboard
Multi-GPU training does not work when the hardware architecture is multinode based
Hi,
I believe that your multi_gpu_model.py file does work when the hardware design is in one node(1CPU, and >=1GPU), however if I have a VM with multiple nodes set up in the cloud (Azure) , then you're code does not work. I've had to modify your source code logic by doing the following :
1- I basically ignored the setGPUUsage method in the DetectionModelTrainer class, and hardcode the number of GPUs like so(you don't need to do this but I did it anyways):
2-In the _create model function, when you calculate the length of the multi_gpu I commented some code to skip the multi_gpu_model function that you import. Therefore, it just return the regular model :
3- Then when we train the model I added the tf.distribute.MultiWorkerMirrorStrategy() class . Then, a scope was created where all the variables will be copied to all available (GPUs) or model replicas. Finally, the model is trained outside of the scope, like so :
4- This actually works , however as observed in the pictures below, some of the GPUs are working in parallel but never to their 100% capacity, and sometimes only 2 out of the 4 GPUs work. This results in the fact that when I train with 1 CPU and 1 GPU , the YOLOVv3 architecture trains faster than with multiple GPUs(multiworkers ).Would you guys happen to know why this is happening? Or maybe your multi_gpu_model function is able to work with multiworker set up but it does not work for me?
Any tips would be appreciated ,
Best Regards,
Edwin Aguirre