recommenders
recommenders copied to clipboard
Significant drop in the model's performance metric (Top K Accuracy) when we go from 1 GPU to 2 or 4 GPUs.
Hi, as the title says, the performance of the model drops when I use a cluster of GPUs. The (custom) training is being done in vertex training service. This is the image I am using: us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-9:latest These are the machine types: a2-ultragpu-1g, a2-ultragpu-2g, a2-ultragpu-4g each with 1, 2 and 4 GPUs respectively. I'm following this tutorial: https://www.tensorflow.org/recommenders/examples/diststrat_retrieval This is my implementation of the strategy: strategy
tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.ReductionToOneDevice(reduce_to_device="cpu:0"))
I increased the batch size in 2 and 4 GPUS batch_size_2GPUs = batch_size_1GPUx2 batch_size_4GPUs = batch_size_1GPUx4
Is there anything else I need to do, at the code level, to have the same performance values in each case? Thanks in advance.