hypergradient-descent
hypergradient-descent copied to clipboard
Ideas for Extension
Merhaba Atilim,
I would like to share few ideas for extension of the method.
-
Warm Restarts: It would be great to use the method in a cyclic learning-rate fashion. I have tried to reset the learning-rate externally whenever it is lower than a value and decayed the initial learning-rate to which reset according to the epoch. But I am sure that you can come up with a mathematically more robust way of doing this. https://arxiv.org/abs/1608.03983
-
Sparsification: The method offers a good way of detecting the convergence in order to sparsify the smallest weights of the network as it has been proven to be useful in the case of dense-sparse-dense training (https://arxiv.org/abs/1607.04381). Below is a code to perform such sparsification:
def sparsify(module, sparsity=0.25):
for m in module.modules():
if hasattr(m, 'weight') and isinstance(m, (nn.Conv1d, nn.Linear)):
wv = m.weight.data.view(-1)
mask = torch.zeros(m.weight.size()).byte().cuda()
k = int(math.floor(sparsity*wv.numel()))
smallest_idx = wv.abs().topk(k, dim=0, largest=False)[1]
mask.view(-1)[smallest_idx] = 1
m.weight.data.masked_fill_(mask, 0.)
- Partially Adaptive Momentum Estimation. Given that there are research that supports the idea of switching from Adam to SGD at later epochs for better generalization, I have done an implementation of this by starting the parameter and decaying it to a hypertuned lower value (between 0.0 and 1.0). I am curious if the proposed method here can also provide a better dynamic way of achieving this.
p.data.addcdiv_(-step_size, exp_avg, denom**partial)
I would also investigate how it can help enabling the "Super Convergence": https://arxiv.org/abs/1708.07120
Partially Adaptive Momentum Estimation. Given that there are research that supports the idea of switching from Adam to SGD at later epochs for better generalization, I have done an implementation of this by starting the parameter and decaying it to a hypertuned lower value (between 0.0 and 1.0). I am curious if the proposed method here can also provide a better dynamic way of achieving this.
Adaptive Gradient Methods with Dynamic Bound of Learning Rate: https://arxiv.org/abs/1902.09843