benchmarks icon indicating copy to clipboard operation
benchmarks copied to clipboard

Horovod converges slow for resnet

Open jiajinyu opened this issue 6 years ago • 21 comments

When using real ImageNet datasets instead synthetic ones, we found horovod converges much slower than replicated with NCCL only on ResNet.

We are aware of the fix #190 by @alsrgv . We test some other network such as vgg11 and alexnet as mentioned in the issue #189 . Both NCCL and Horovod converge in a similar speed for these networks.

We were suspecting the shortcut structure in ResNet may cause some problems so we also test mobilenet in the benchmark. Still NCCL and Horovod converge as the same.

Not sure what went wrong. Any ideas would be appreciated. with @lcytzk

cc @alsrgv

jiajinyu avatar Jun 01 '18 05:06 jiajinyu

@jiajinyu, can you share the exact commands you're using for NCCL and Horovod, version of TensorFlow, Horovod, and git sha of benchmarks you're using?

alsrgv avatar Jun 01 '18 05:06 alsrgv

@alsrgv , thanks for the prompt reply

For NCCL, we use

  python /tensorflow_benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
  --data_format=NCHW --batch_size=64 --model=resnet50_v2 --optimizer=momentum \
  --variable_update=replicated --nodistortions --allow_growth=True --all_reduce_spec=nccl\
  --print_training_accuracy=True --num_epochs=1 --weight_decay=1e-4 \
  --num_gpus=4 \
  --data_dir=/datasets/imagenet/ --display_every=100  --train_dir=/xyz/train/nccl/res_4_5_16

For Horovod,

  mpirun --allow-run-as-root -np 4 -H host1:1,host2:1,host3:1,host4:1 \
  -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=bond0  -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib  \
  -mca btl_tcp_if_include bond0  -mca plm_rsh_args "-p 9091 -i /root/.ssh/id_rsa_horovod" \
  python /tensorflow_benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
  --data_format=NCHW --batch_size=64 --model=resnet50 --optimizer=momentum \
  --variable_update=horovod --nodistortions --allow_growth=True \
  --print_training_accuracy=True --num_epochs=1 --weight_decay=1e-4 \
  --num_gpus=1 \
  --data_dir=/datasets/imagenet/ --display_every=100  --train_dir=/yjj/train/nccl/res_4_5_16'

We are using the Horovod docker on the github. Link.

Sorry I don't understand the meaning of git sha.

jiajinyu avatar Jun 01 '18 05:06 jiajinyu

Thanks! I notice you use resnet50_v2 in one case, and resnet50 in another - is that intentional?

To get the git sha, please use git rev-parse HEAD in your benchmarks git repo. This will pinpoint the exact commit you're at.

alsrgv avatar Jun 01 '18 05:06 alsrgv

the commit is d7b68b146c82ee9b936bd196c9f1ed6d54f4a1c7 (fixed)

for v2, etc. this is not intentional, we test both and neither of them converges as the same. I just pasted two versions. that was a mistake.

It's likely the commit is not in the master because it contain

jiajinyu avatar Jun 01 '18 05:06 jiajinyu

I am running a repro now. The first issue I encountered is that with the exact flags that you specified, benchmark fails with: ValueError: Could not identify name of dataset. Please specify with --data_name option..

I have added --data_name imagenet in my repro. However, can you double check the commands that you used?

alsrgv avatar Jun 01 '18 07:06 alsrgv

Sure. Let me rerun with the additional flag.

jiajinyu avatar Jun 01 '18 07:06 jiajinyu

I am checking the code. I think you using some different data_dir? The code is like this

  # Infere dataset name from data_dir if data_name is not provided.
  if data_name is None:
    for supported_name in _SUPPORTED_DATASETS:
      if supported_name in data_dir:
        data_name = supported_name
        break
    else:  # Failed to identify dataset name from data dir.
      raise ValueError('Could not identify name of dataset. '
                       'Please specify with --data_name option.')

For my case, it passes because 'imagenet' in '/datasets/imagenet' == True.

jiajinyu avatar Jun 01 '18 07:06 jiajinyu

@jiajinyu, I'm looking more into it. It's possible that https://github.com/tensorflow/benchmarks/commit/d7b68b146c82ee9b936bd196c9f1ed6d54f4a1c7#diff-eae1728a56f07ec0458d8cc14f288807R2370 should be reverted and, instead, learning rate should be increased. I did a quick test and got much better convergence, but I need to look more to fully understand the problem.

alsrgv avatar Jun 01 '18 21:06 alsrgv

@alsrgv , thanks a lot for working on this so quickly. Looking forward to your solution.

jiajinyu avatar Jun 02 '18 01:06 jiajinyu

Submitted PR #200 which fixes the learning rate adjustment for models that have custom learning rate schedule, e.g. ResNet-50.

image

cc @reedwm

alsrgv avatar Jun 02 '18 08:06 alsrgv

At a glance it seems to me that the problem is at --variable-update=replicated side rather than horovod side. LocalFetchFromPS and other variable manager uses mean-allreduce however LocalReplicated seems to be using sum-allreduce. Given that the learning rate is computed from self.batch_size = param.batch_size * num_gpus, it should use mean-allreduce instead. Horovod computes learning rate from self.batch_size = param.batch_size * 1, and with sum-allreduce this seems just right. But I may be very wrong on my comments above. This codebase grows very fast and perhaps there is another scaling factor somewhere that I missed.

Oh but ResnetModel.get_learning_rate() has a num_batches_per_epoch thing which does require the "total" batch size across workers and gpus ... a lot of mess..

ppwwyyxx avatar Jun 02 '18 09:06 ppwwyyxx

I think the right way is to compute learning rate based on # examples / total batch size and #200 fixes that. That said, having to sum gradients instead of averaging seems strange indeed - but that's what --variable-update=replicated is apparently doing in both default and NCCL modes.

alsrgv avatar Jun 02 '18 09:06 alsrgv

Yeah I agree but it looks like VariableMgrLocalFetchFromPS is doing averaging rather than sum? The three modes have to agree with each other at the very least, so perhaps there is more to be fixed by the team.

As long as they agree with each other, they are all arguably not-bad choices. But currently it seems that replicated is different from the other two.

ppwwyyxx avatar Jun 02 '18 10:06 ppwwyyxx

Indeed. @reedwm, what do you think?

alsrgv avatar Jun 02 '18 10:06 alsrgv

Thanks for your work! Learning rate may be the point, but it is odd that vgg, alexnet and mobilenet work fine based on our testing. Learing rate is very important for training, it should affect all traning but why only resnet? By the way, we did our testing by using 4 devices, 4 maybe too small? Emm, I don't think so.

lcytzk avatar Jun 02 '18 13:06 lcytzk

@lcytzk, I think it's because they all define a different learning rate schedule that may depend on the batch size. ResNet defines learning rate warmup which is affected by batch size immediately, whereas AlexNet just uses it for decay. VGG & MobileNet do not define learning rate adjustment schedule at all.

alsrgv avatar Jun 02 '18 19:06 alsrgv

@alsrgv , thanks a lot for the fix.@lcytzk and I tested in our case and both Resnet and VGG, etc work fine.

jiajinyu avatar Jun 03 '18 07:06 jiajinyu

As @ppwwyyxx stated, the learning rate computations are a mess. I don't think anyone at Google has ever run distributed tf_cnn_benchmarks to convergence, at least not recently (@tfboyd can you confirm?) We have verified convergence on Resnet50 in the single-machine case.

@ppwwyyxx is correct in that in replicated and distributed_replicated modes, we do a sum-allreduce. In other modes, we do a mean all-reduce. The idea behind doing a sum is that it increases performance by avoiding a division, but we really should be consistent. The solution currently is that when using replicated mode, on the command line, decrease the learning rate by the factor (num gpus per worker). I believe default learning rates assumes the all-reduce will be a sum, not a mean (@bignamehyp, can you confirm the default learning rate assumptions?)

Also, the get_learning_rate takes a batch_size parameter, that currently is (batch size per GPU) * (num GPUS). It does not currently take into account the number of workers. The reason it is not multiplied by the number of workers is that each worker independently increments the global_step. In non-Horovod mode, each worker shares the global_step variable, so that every step, global_step is increased by num_workers, making global_step num_workers times larger than the true number of steps run so far.. The fact that the batch_size parameter to get_learning_rate is num_workers times smaller is cancelled out by the fact that global_step is num_workers times larger.

Unfortunately, I believe in Horovod (although I haven't tested), each worker has its own global_step. #200 fixes this by by passing the total batch_size, not the per-worker batch_size, to get_learning_rate. This should only be done when using Horovod.

I apologize for the complete mess the learning rate code is. Eventually I hope to simplify it.

reedwm avatar Jun 04 '18 17:06 reedwm

@reedwm, sorry for the delay. Makes sense, I will update the PR.

alsrgv avatar Jun 10 '18 22:06 alsrgv

@reedwm There is another PR that is about horovod-based Inception-v3 convergence. Can you please help review? https://github.com/tensorflow/models/pull/4276

wei-v-wang avatar Jun 12 '18 19:06 wei-v-wang

that's a big mess, lr is very important! #205

anpark avatar Jun 13 '18 07:06 anpark