Advanced-optimizer-with-Gradient-Centralization
Advanced-optimizer-with-Gradient-Centralization copied to clipboard
Advanced optimizer with Gradient-Centralization
Advanced-optimizer-with-Gradient-Centralization
Advanced optimizer with Gradient-Centralization Please Refer to
Gradient Centralization: A New Optimization Technique for Deep Neural Networks
Introduction
We embed GC into some advanced DNN optimizers, including SGD.py,
Adam.py, AdamW, RAdam,Lookahead+SGD.py, Lookahead+Adam.py, Ranger.
There are three hyper-parameters use_gc, gc_conv_only and gc_loc. use_gc=True means that the algorithm adds GC operation, otherwise, not. gc_conv_only=True means the algorithm only adds GC operation for Conv layer, otherwise, for both Conv and FC layer. gc_loc controls the location of GC operation for adaptive learning rate algorithms, including Adam, Radam, Ranger and so on. There are two locations in the algorithm to add GC operation for original gradient and generalized gradient, respectively. Generalized gradient is the variable which is directly used to update the weight. For adaptive learning rate algorithms, we suggest gc_loc=False. For SGD, these two locations for GC are equivalent, so we do not introduce the hyper-parameter gc_loc.
We also give an example of how to use these algorithms in Cifar.
For example:
# SGD
optimizer = SGD(net.parameters(), lr=args.lr, momentum=0.9,weight_decay = args.weight_decay,use_gc=True, gc_conv_only=False)
# Adam
optimizer = Adam(net.parameters(), lr=args.lr, weight_decay = args.weight_decay,use_gc=True, gc_conv_only=False,gc_loc=False)
# RAdam
optimizer = RAdam(net.parameters(), lr=args.lr, weight_decay = args.weight_decay,use_gc=True, gc_conv_only=False,gc_loc=False)
# lookahead+SGD
base_opt = SGD(net.parameters(), lr=args.lr, momentum=0.9,weight_decay = args.weight_decay,use_gc=False, gc_conv_only=False)
optimizer = Lookahead(base_opt, k=5, alpha=0.5)
# Ranger
optimizer = Ranger(net.parameters(), lr=args.lr, weight_decay = args.weight_decay,use_gc=True, gc_conv_only=False,gc_loc=False)
References:
-
Adam: https://arxiv.org/abs/1412.6980
-
AdamW: https://arxiv.org/abs/1711.05101
-
Lookahead: https://arxiv.org/abs/1907.08610
-
RAdam: https://arxiv.org/abs/1908.03265, https://github.com/LiyuanLucasLiu/RAdam
-
Ranger: https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer
-
Gradient Centralization: https://arxiv.org/abs/2004.01461v2