recommenders Significant drop in the model's performance metric (Top K Accuracy) when we go from 1 GPU to 2 or 4 GPUs.

Significant drop in the model's performance metric (Top K Accuracy) when we go from 1 GPU to 2 or 4 GPUs.

Open hugoferrero opened this issue 1 year ago • 0 comments

Hi, as the title says, the performance of the model drops when I use a cluster of GPUs. The (custom) training is being done in vertex training service. This is the image I am using: us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-9:latest These are the machine types: a2-ultragpu-1g, a2-ultragpu-2g, a2-ultragpu-4g each with 1, 2 and 4 GPUs respectively. I'm following this tutorial: https://www.tensorflow.org/recommenders/examples/diststrat_retrieval This is my implementation of the strategy: strategy

tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.ReductionToOneDevice(reduce_to_device="cpu:0"))

I increased the batch size in 2 and 4 GPUS batch_size_2GPUs = batch_size_1GPUx2 batch_size_4GPUs = batch_size_1GPUx4

Is there anything else I need to do, at the code level, to have the same performance values in each case? Thanks in advance.

Aug 08 '24 20:08 hugoferrero

recommenders recommenders copied to clipboard

Significant drop in the model's performance metric (Top K Accuracy) when we go from 1 GPU to 2 or 4 GPUs.

recommenders
recommenders copied to clipboard