benchmarks icon indicating copy to clipboard operation
benchmarks copied to clipboard

How to evaluate worker performance independently on a distributed training

Open delucca opened this issue 3 years ago • 0 comments

Hi

I'm trying to evaluate the performance of each worker independently in a cluster with multiple machines while training them using the same model. My goal is to record each worker training performance.

Every setup and config that I try I always get the same time for all workers (probably because of synchronization issues). So, even if one of my workers is a machine that is 4x faster, it would still record the same time as the slowest machine in the cluster.

Anyone has any idea how can I do that?

delucca avatar Feb 15 '22 03:02 delucca