benchmarks icon indicating copy to clipboard operation
benchmarks copied to clipboard

Using batch_size * num_gpus as batch_size in exponential_decay calculation?

Open h8907283 opened this issue 7 years ago • 5 comments

Hi guys,

I've been reading benchmark_cnn.py and trying to figure out what batch_size really is in a multi-GPU setup when using it to calculate num_batches_per_epoch and decay_steps.

num_batches_per_epoch = (float(num_examples_per_epoch) / batch_size)

The batch_size value here is the command-line batch size value * the command-line #GPUs, so if FLAGS.batch_size = 64 and FLAGS.num_gpus = 8, then batch_size = 512.

However, in other code, e.g. (https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py), num_batches_per_epoch is calculated by using FLAGS.batch_size.

Which one is correct?

This affects the learning rate schedule. When I read a research paper and it suggests batch size=64, learning rate=0.045, number of epochs per decay=2.5 and number of images per epoch=1271167, what should be the value of decay steps??

h8907283 avatar Mar 23 '18 20:03 h8907283

tf_cnn_benchmarks is correct here. The effective batch size of a model is the batch size per GPU, times the number of GPUs.

@nealwu, that other model seems to have an incorrect calculation. Can you comment?

As for the research paper, does it specify the batch size being 64, or 64 per GPU? Perhaps they only ran with 1 GPU, in which case it doesn't matter and you should run with 1 GPU as well. If it's 64 per GPU, then you should set --num_epochs_per_decay=2.5

reedwm avatar Mar 26 '18 23:03 reedwm

Thanks @reedwm for your answer. The training in question is MobileNet v1. The MobileNet v1 paper didn't mention much about training parameters. However, you can find them in mobilenet_v1_train.py (https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1_train.py) and mobilenet_v1.py (https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.py).

batch size = 64 initial learning rate = 0.045 learning rate decay factor = 0.94 number of epochs per decay = 2.5 imagenet size = 1271167 optimizer = tf.train.GradientDescentOptimizer weight decay = 0.00004 batch norm decay = 0.9997 batch norm epsilon = 0.001

With that, the exponential decay learning rate schedule starts at 0.045 and decreases by (1 - 0.94 = 0.06) for each (1271167 / 64 * 2.5 = 49655) steps.

The script supports 1 GPU. But if I want to use tf_cnn_benchmarks and 8 GPUs on AWS p2.x8large, I'll have to add the mobilenet_v1 model, which is straight-forward. Then the tf_cnn_benchmark call would look like:

python tf_cnn_benchmarks.py --model=mobilenet_v1 --batch_size=64 --data_format=NCHW --device=gpu --num_gpus=8 --variable_update=replicated --local_parameter_device=gpu --all_reduce_spec='nccl' --data_dir=/home/ubuntu/datasets/imagenet --data_name=imagenet --print_training_accuracy=True --num_epochs=400 --learning_rate=0.045 --learning_rate_decay_factor=0.94 --num_epochs_per_decay=2.5 --train_dir=/tmp/train --summary_verbosity=1 --save_summaries_steps=60 --save_model_secs=600

With that, the exponential decay learning rate schedule starts at 0.045 and decreases by (1 - 0.94 = 0.06) for each (1271167 / (64 * 8) * 2.5 = 6207) steps. The rate seems to be dropping too fast for a long training (400 epochs). The result is that the training loss doesn't drop enough at the beginning and stay very flat in the middle and at the end of the training because the learning rate drops too low.

What do you think the tf_cnn_benchmark command-line parameters should be for successfully training MobileNetV1 on ImageNet using this benchmark?

Thanks!

h8907283 avatar Mar 27 '18 18:03 h8907283

For multi-GPU CIFAR-10 we recommend looking at https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator instead. The other implementation you linked is likely out of date.

nealwu avatar Mar 27 '18 18:03 nealwu

Thanks @nealwu.

The following is in https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10_estimator/cifar10_main.py:

num_batches_per_epoch = cifar10.Cifar10DataSet.num_examples_per_epoch('train') // (params.train_batch_size * num_workers)

It looks like you're in agreement with @reedwm.

Now back to the questions in my previous post, the number of steps performed before decaying the learning rate will be much smaller with the multi-GPU setup, should I simply bump up the num_epochs_per_decay? I guess your answer would be 'yes'.

The issue I can see is that, after some experimentation, even when the learning rate schedule is flat at 0.045, the training loss graphs (flat at 0.045 and exponential decay) look pretty much the same, i.e. loss decreasing too slowly and loss staying high after long training.

Should I be mucking with learning rate more or switch to using a different optimizer? What's your opinion?

h8907283 avatar Mar 27 '18 20:03 h8907283

I'm not an ML expert so it's hard for me to say. For some models, to train with double the batch size, one should double the learning rate. In such cases, num_epochs_per_decay remains the same, so the number of steps per decay halves when the batch size doubles.

Using 8 GPUs effectively makes the batch size 8x, so if one were to follow that advice, one would 8x the learning rate. However, with --variable_update=replicated, the learning rate is already effectively 8x-ed (I will fix that inconsistency at some point), so it seems like you are following the above advice correctly.

Perhaps the above advice won't apply here and you do need to modify --learning_rate_decay_factor. I am not sure.

@suharshs @bignamehyp do either of you have any comments?

reedwm avatar Mar 27 '18 20:03 reedwm