dlbench
dlbench copied to clipboard
[BUG] Learning rate is not passed to network scripts
From benchmark.py and configs/*.config, we know dlbench provide capability of changing learning rate.
However, only Caffe, Torch, MXNet accepts learning rate argument while CNTK, TensorFlow ignores them.
# tools/cntk/cntkbm.py has no lr argument defined.
# tools/tensorflow/tensorflow.py has no lr argument defined.
Furthermore, the learning rate is not the same when running benchmark. For example, TensorFlow uses constant value while MXNet's learning rate will change during training.
# From tools/mxnet/common/fit.py
steps = [epoch_size * (x-begin_epoch) for x in step_epochs if x-begin_epoch > 0] # Default value of step_epochs is '200,250' from tools/mxnet/train_cifa10.py
return (lr, mx.lr_scheduler.MultiFactorScheduler(step=steps, factor=args.lr_factor))
......
optimizer_params = {'learning_rate': lr,
'momentum' : args.mom,
'wd' : args.wd,
'lr_scheduler': lr_scheduler} # This scheduler will change learning rate during training
Please let all tools support learning rate parameter or just delete learning rate from config.
The schedule of learning rate of MXNet is not used. Please check the code: https://github.com/hclhkbu/dlbench/blob/master/tools/mxnet/common/fit.py#L8. The parameter of lr_factor
is set to be None. For CNTK and TF, we set them to be fixed.
@shyhuai But you set the default value of lr_factor to 0.1 at https://github.com/hclhkbu/dlbench/blob/master/tools/mxnet/common/fit.py#L63.
train.add_argument('--lr-factor', type=float, default=0.1, help='the ratio to reduce lr on each step')
So, no matter whether we explicitly set lr_factor in command-line arguments, argparse.ArgumentParser will always set the lr_factor. Check the log of MXNet MNIST:
INFO:root:start with arguments Namespace(batch_size=1024, data_dir='/home/shaocs/dlbench/dataset/mxnet/mnist', disp_batches=100, gpus='0', kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='10', model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=2, num_examples=60000, num_layers=None, num_nodes=1, optimizer='sgd', test_io=0, top_k=0, wd=1e-05)
......
INFO:root:Update[586]: Change learning rate to 5.00000e-03 # This is printed after 8 epochs
@shishaochen Thanks for you feedback. Since we set lr_factor=1 in the script of: mxnetbm.py
, the learning rate will not be changed during training. If you use the script of mxnetbm.py, it could be no problem. Here is the log for your reference: http://dlbench.comp.hkbu.edu.hk/logs/?f=mxnet-fc-fcn5-gpu0-K80-b512-Tue_Mar__7_10:52:06_2017-gpu20.log. In case of misunderstanding, I have revised the code to set the default value to None. Thank again for your report.
@shyhuai Sorry. I cannot find "factor" set in https://github.com/hclhkbu/dlbench/blob/master/tools/mxnet/mxnetbm.py. Maybe you set it locally but the change is not committed yet.