recommenders icon indicating copy to clipboard operation
recommenders copied to clipboard

multiple GPUs are not working properly in distribution tutorial

Open canonrock16 opened this issue 4 years ago • 4 comments

I ran this tutorial on google cloud compute engine. My instance has 2 GPUs(A100),and tensorflow recognizes my GPUs.

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 7266000849052307782,
 name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 40332623872
 locality {
   bus_id: 1
   links {
     link {
       device_id: 1
       type: "StreamExecutor"
       strength: 1
     }
   }
 }
 incarnation: 12716067537431769392
 physical_device_desc: "device: 0, name: A100-SXM4-40GB, pci bus id: 0000:00:04.0, compute capability: 8.0",
 name: "/device:GPU:1"
 device_type: "GPU"
 memory_limit: 40332623872
 locality {
   bus_id: 1
   links {
     link {
       type: "StreamExecutor"
       strength: 1
     }
   }
 }
 incarnation: 6366335206738999217
 physical_device_desc: "device: 1, name: A100-SXM4-40GB, pci bus id: 0000:00:05.0, compute capability: 8.0"]

But, during training, my 2nd GPU’s utilization is around 1~2%.(1st GPU’s utilization is around 40%). Oct-06-2021 17-53-19

How can I get the second GPU to work properly?

canonrock16 avatar Oct 06 '21 09:10 canonrock16

Did you get any warning or error? I got some warning saying it will reduce it to 1 GPU/CPU.

xiaoyaoyang avatar Nov 11 '21 18:11 xiaoyaoyang

The fact that both GPU's memory are saturated means it is using both GPUs, just not efficiently.

Did you get a warning like this "WARNING:tensorflow:Efficient allreduce is not supported for 2 IndexedSlices"? This seems to be a known performance issue w/ diststrat (https://github.com/tensorflow/tensorflow/issues/41898). I think the workaround is to use MultiWorkerMirroredStrategy.

windmaple avatar Nov 22 '21 10:11 windmaple

It seems like tfrs.Model do not use multi gpus. so I've changed tfrs.Model to tf.keras.Model by referring to official doc, and it just worked well. all my gpus shows high gpu-utilization

kim-sardine avatar Mar 07 '22 07:03 kim-sardine

@kim-sardine thanks! the tfrs.Model use tf.keras.Model though.... will give it a try

Seems it is a known issue, replacing with tf.distribute.MultiWorkerMirroredStrategy seems to be a walkaround.

xiaoyaoyang avatar Mar 08 '22 21:03 xiaoyaoyang