benchmarks
benchmarks copied to clipboard
Horovod converges slow for resnet
When using real ImageNet datasets instead synthetic ones, we found horovod converges much slower than replicated with NCCL only on ResNet.
We are aware of the fix #190 by @alsrgv . We test some other network such as vgg11 and alexnet as mentioned in the issue #189 . Both NCCL and Horovod converge in a similar speed for these networks.
We were suspecting the shortcut structure in ResNet may cause some problems so we also test mobilenet in the benchmark. Still NCCL and Horovod converge as the same.
Not sure what went wrong. Any ideas would be appreciated. with @lcytzk
cc @alsrgv
@jiajinyu, can you share the exact commands you're using for NCCL and Horovod, version of TensorFlow, Horovod, and git sha of benchmarks you're using?
@alsrgv , thanks for the prompt reply
For NCCL, we use
python /tensorflow_benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--data_format=NCHW --batch_size=64 --model=resnet50_v2 --optimizer=momentum \
--variable_update=replicated --nodistortions --allow_growth=True --all_reduce_spec=nccl\
--print_training_accuracy=True --num_epochs=1 --weight_decay=1e-4 \
--num_gpus=4 \
--data_dir=/datasets/imagenet/ --display_every=100 --train_dir=/xyz/train/nccl/res_4_5_16
For Horovod,
mpirun --allow-run-as-root -np 4 -H host1:1,host2:1,host3:1,host4:1 \
-x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=bond0 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib \
-mca btl_tcp_if_include bond0 -mca plm_rsh_args "-p 9091 -i /root/.ssh/id_rsa_horovod" \
python /tensorflow_benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--data_format=NCHW --batch_size=64 --model=resnet50 --optimizer=momentum \
--variable_update=horovod --nodistortions --allow_growth=True \
--print_training_accuracy=True --num_epochs=1 --weight_decay=1e-4 \
--num_gpus=1 \
--data_dir=/datasets/imagenet/ --display_every=100 --train_dir=/yjj/train/nccl/res_4_5_16'
We are using the Horovod docker on the github. Link.
Sorry I don't understand the meaning of git sha
.
Thanks! I notice you use resnet50_v2 in one case, and resnet50 in another - is that intentional?
To get the git sha, please use git rev-parse HEAD
in your benchmarks git repo. This will pinpoint the exact commit you're at.
the commit is d7b68b146c82ee9b936bd196c9f1ed6d54f4a1c7
(fixed)
for v2, etc. this is not intentional, we test both and neither of them converges as the same. I just pasted two versions. that was a mistake.
It's likely the commit is not in the master because it contain
I am running a repro now. The first issue I encountered is that with the exact flags that you specified, benchmark fails with: ValueError: Could not identify name of dataset. Please specify with --data_name option.
.
I have added --data_name imagenet
in my repro. However, can you double check the commands that you used?
Sure. Let me rerun with the additional flag.
I am checking the code. I think you using some different data_dir? The code is like this
# Infere dataset name from data_dir if data_name is not provided.
if data_name is None:
for supported_name in _SUPPORTED_DATASETS:
if supported_name in data_dir:
data_name = supported_name
break
else: # Failed to identify dataset name from data dir.
raise ValueError('Could not identify name of dataset. '
'Please specify with --data_name option.')
For my case, it passes because 'imagenet' in '/datasets/imagenet' == True
.
@jiajinyu, I'm looking more into it. It's possible that https://github.com/tensorflow/benchmarks/commit/d7b68b146c82ee9b936bd196c9f1ed6d54f4a1c7#diff-eae1728a56f07ec0458d8cc14f288807R2370 should be reverted and, instead, learning rate should be increased. I did a quick test and got much better convergence, but I need to look more to fully understand the problem.
@alsrgv , thanks a lot for working on this so quickly. Looking forward to your solution.
Submitted PR #200 which fixes the learning rate adjustment for models that have custom learning rate schedule, e.g. ResNet-50.
cc @reedwm
At a glance it seems to me that the problem is at --variable-update=replicated side rather than horovod side.
LocalFetchFromPS
and other variable manager uses mean-allreduce however LocalReplicated
seems to be using sum-allreduce. Given that the learning rate is computed from self.batch_size = param.batch_size * num_gpus
, it should use mean-allreduce instead.
Horovod computes learning rate from self.batch_size = param.batch_size * 1
, and with sum-allreduce this seems just right.
But I may be very wrong on my comments above. This codebase grows very fast and perhaps there is another scaling factor somewhere that I missed.
Oh but ResnetModel.get_learning_rate()
has a num_batches_per_epoch
thing which does require the "total" batch size across workers and gpus ... a lot of mess..
I think the right way is to compute learning rate based on # examples / total batch size
and #200 fixes that. That said, having to sum gradients instead of averaging seems strange indeed - but that's what --variable-update=replicated
is apparently doing in both default and NCCL modes.
Yeah I agree but it looks like VariableMgrLocalFetchFromPS
is doing averaging rather than sum? The three modes have to agree with each other at the very least, so perhaps there is more to be fixed by the team.
As long as they agree with each other, they are all arguably not-bad choices. But currently it seems that replicated is different from the other two.
Indeed. @reedwm, what do you think?
Thanks for your work! Learning rate may be the point, but it is odd that vgg, alexnet and mobilenet work fine based on our testing. Learing rate is very important for training, it should affect all traning but why only resnet? By the way, we did our testing by using 4 devices, 4 maybe too small? Emm, I don't think so.
@lcytzk, I think it's because they all define a different learning rate schedule that may depend on the batch size. ResNet defines learning rate warmup which is affected by batch size immediately, whereas AlexNet just uses it for decay. VGG & MobileNet do not define learning rate adjustment schedule at all.
@alsrgv , thanks a lot for the fix.@lcytzk and I tested in our case and both Resnet and VGG, etc work fine.
As @ppwwyyxx stated, the learning rate computations are a mess. I don't think anyone at Google has ever run distributed tf_cnn_benchmarks
to convergence, at least not recently (@tfboyd can you confirm?) We have verified convergence on Resnet50 in the single-machine case.
@ppwwyyxx is correct in that in replicated and distributed_replicated modes, we do a sum-allreduce. In other modes, we do a mean all-reduce. The idea behind doing a sum is that it increases performance by avoiding a division, but we really should be consistent. The solution currently is that when using replicated mode, on the command line, decrease the learning rate by the factor (num gpus per worker). I believe default learning rates assumes the all-reduce will be a sum, not a mean (@bignamehyp, can you confirm the default learning rate assumptions?)
Also, the get_learning_rate
takes a batch_size parameter, that currently is (batch size per GPU) * (num GPUS). It does not currently take into account the number of workers. The reason it is not multiplied by the number of workers is that each worker independently increments the global_step
. In non-Horovod mode, each worker shares the global_step variable, so that every step, global_step
is increased by num_workers
, making global_step
num_workers
times larger than the true number of steps run so far.. The fact that the batch_size parameter to get_learning_rate
is num_workers
times smaller is cancelled out by the fact that global_step
is num_workers
times larger.
Unfortunately, I believe in Horovod (although I haven't tested), each worker has its own global_step
. #200 fixes this by by passing the total batch_size, not the per-worker batch_size, to get_learning_rate
. This should only be done when using Horovod.
I apologize for the complete mess the learning rate code is. Eventually I hope to simplify it.
@reedwm, sorry for the delay. Makes sense, I will update the PR.
@reedwm There is another PR that is about horovod-based Inception-v3 convergence. Can you please help review? https://github.com/tensorflow/models/pull/4276
that's a big mess, lr is very important! #205