recommenders
recommenders copied to clipboard
multiple GPUs are not working properly in distribution tutorial
I ran this tutorial on google cloud compute engine. My instance has 2 GPUs(A100),and tensorflow recognizes my GPUs.
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 7266000849052307782,
name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 40332623872
locality {
bus_id: 1
links {
link {
device_id: 1
type: "StreamExecutor"
strength: 1
}
}
}
incarnation: 12716067537431769392
physical_device_desc: "device: 0, name: A100-SXM4-40GB, pci bus id: 0000:00:04.0, compute capability: 8.0",
name: "/device:GPU:1"
device_type: "GPU"
memory_limit: 40332623872
locality {
bus_id: 1
links {
link {
type: "StreamExecutor"
strength: 1
}
}
}
incarnation: 6366335206738999217
physical_device_desc: "device: 1, name: A100-SXM4-40GB, pci bus id: 0000:00:05.0, compute capability: 8.0"]
But, during training, my 2nd GPU’s utilization is around 1~2%.(1st GPU’s utilization is around 40%).

How can I get the second GPU to work properly?
Did you get any warning or error? I got some warning saying it will reduce it to 1 GPU/CPU.
The fact that both GPU's memory are saturated means it is using both GPUs, just not efficiently.
Did you get a warning like this "WARNING:tensorflow:Efficient allreduce is not supported for 2 IndexedSlices"? This seems to be a known performance issue w/ diststrat (https://github.com/tensorflow/tensorflow/issues/41898). I think the workaround is to use MultiWorkerMirroredStrategy.
It seems like tfrs.Model do not use multi gpus.
so I've changed tfrs.Model to tf.keras.Model by referring to official doc, and it just worked well. all my gpus shows high gpu-utilization
@kim-sardine thanks! the tfrs.Model use tf.keras.Model though.... will give it a try
Seems it is a known issue, replacing with tf.distribute.MultiWorkerMirroredStrategy seems to be a walkaround.