implicit-hyper-opt icon indicating copy to clipboard operation
implicit-hyper-opt copied to clipboard

Scaling the hessian by learning rate

Open zhichul opened this issue 1 year ago • 7 comments

Thank you for releasing the code! I'm trying to apply this method to train language models where I partition the parameters and setup a bilevel problem where the outer parameters are tuned for generalization on a dev set after training the inner parameters (with the outer fixed) on a training set.

In your paper you note that it is crucial to scale the hessian by the learning rate (of the inner problem) for convergence of the Neumann series. I wonder if there's any specific reason that you chose to use the learning rate as the scaling, rather than, for example, using some other small number and treating it as a separate hyper-parameter?

In practice, with your rnn experiments, I'm curious how much you noticed the choice of learning rate affecting convergence? From my understanding of algorithm 3, a extremely small scaling factor would lead to very slow convergence while a accidentally large one may lead to divergence (e.g. when the hessian has very large eigenvalues). I would expect the hessian (and its largest eigenvalue) to change quite a lot depending on the value of the inner parameters, and thus make this scaling factor a tricky parameter to tune.

Thanks in advance for any advice on this!

Best, Brian

zhichul avatar Jun 05 '23 01:06 zhichul

Hello,

Thanks for your interest in the work! With regards to your question:

You ask about the motivation for using the learning rate as a scale in the Neumann series. Here is an explanation: The K-step neumann series is intimately related to differentiating through K-steps of optimization — another gradient-based HO technique. In fact, when you write them out they are the same on a quadratic objective. The only difference is you take the Neumann series from Eq. 4 - Sum_j=0^k (I - alpha H |w*) changes to Sum_j=0^k (I - learning_rate * H |{w_j}) where w* is the optimal w, while w_j is the weight from the jth step of optimization. K-step differentiation is prohibitively memory expensive as it requires storing all of these intermediary weights. However, this provides a heuristic to use the re-use the learning rate from our inner-optimization as the scaling on the Hessian in the Neumann series. In fact, you can show that for a quadratic the learning rate which converges the quickest is the exact same the scaling factor for the Neumann series which converges the quickest.

However, this is only a heuristic to save a search. Maybe an alpha that is smaller than the LR will be more stable? Or maybe a large alpha will converge more quickly? It could be worth a search, and the learning rate is just a potentially reasonably starting point.

You note that the Hessian’s spectrum could change over optimization, so we may want a different alpha. This is definitely true. We have the same problem with choosing a (single) learning rate :) Most bounds I’ve seen assume strongly-convex and Lipschitz functions and try to work as fast as possible in the worst case. We can do better by dynamically choose a learning rate / alpha I suspect, but being able to select this in a smart way could be difficult.

I hope this helps!

Best, Jon

On Jun 4, 2023, at 21:26, Zhichu Lu @.***> wrote:

Thank you for releasing the code! I'm trying to apply this method to train language models where I partition the parameters and setup a bilevel problem where the outer parameters are tuned for generalization on a dev set after training the inner parameters (with the outer fixed) on a training set.

In your paper you note that it is crucial to scale the hessian by the learning rate (of the inner problem) for convergence of the Neumann series. I wonder if there's any specific reason that you chose to use the learning rate as the scaling, rather than, for example, using some other small number and treating it as a separate hyper-parameter?

In practice, with your rnn experiments, I'm curious how much you noticed the choice of learning rate affecting convergence? From my understanding of algorithm 3, a extremely small scaling factor would lead to very slow convergence while a accidentally large one may lead to divergence (e.g. when the hessian has very large eigenvalues). I would expect the hessian (and its largest eigenvalue) to change quite a lot depending on the value of the inner parameters, and thus make this scaling factor a tricky parameter to tune.

Thanks in advance for any advice on this!

Best, Brian

— Reply to this email directly, view it on GitHub https://github.com/lorraine2/implicit-hyper-opt/issues/15, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFRRRQDT7AC3CALQMM7UT2LXJUYULANCNFSM6AAAAAAY2JQ5MY. You are receiving this because you are subscribed to this thread.

lorraine2 avatar Jun 05 '23 03:06 lorraine2

Hi Jon,

Thanks for the detailed explanation - makes a lot of sense now!

In fact, when you write them out they are the same on a quadratic objective. The only difference is you take the Neumann series from Eq. 4 - Sum_j=0^k (I - alpha H |w*) changes to Sum_j=0^k (I - learning_rate * H |{w_j}) where w* is the optimal w, while w_j is the weight from the jth step of optimization.

Thanks for pointing out the precise connection! I missed the learning_rate part in the diff-through-opt gradient. The derivations / forms in the main text for diff-through-opt assumed alpha=1.0 (which I believe is not required by the proofs in the appendix though it sure makes the algebra cleaner and more concise!).

In fact, you can show that for a quadratic the learning rate which converges the quickest is the exact same the scaling factor for the Neumann series which converges the quickest.

That's so neat! :-)

Maybe an alpha that is smaller than the LR will be more stable? Or maybe a large alpha will converge more quickly? It could be worth a search, and the learning rate is just a potentially reasonably starting point. We can do better by dynamically choose a learning rate / alpha I suspect, but being able to select this in a smart way could be difficult.

Yes I agree - I will probably try using a constant alpha but searching towards the larger end to see how much iterations I can save on the Neumann series without suffering too much error.

Thanks again for the quick response!

Best, Brian

zhichul avatar Jun 05 '23 04:06 zhichul

Hi Jon,

A follow up question. I found there are two versions of Algorithm 3, one in your AISTATS paper and another in your arxiv version. I am wondering if the followng should be correct: Algorithm 3 approxInverseHVP(v,f): Neumann approximation of inverse-Hessian-vector product $v[∂f/∂w]^{-1}$ 1: Initialize sum $p = v$ 2: for $j = 1 . . . i$ do 3: $v −= α · grad(f, w, gradoutputs = v)$ 4: $p += v$ 5: return $p $

Thank you, Best, Shancong

Shancong-Mou avatar Sep 05 '23 21:09 Shancong-Mou

The AISTATS version is correct. We had a sign reversal typo in the arXiv version. Sorry about any confusion for that!

lorraine2 avatar Sep 05 '23 22:09 lorraine2

Thanks, the AISTATS paper returns $\alpha p$, shall we just return $p$ instead?

Shancong-Mou avatar Sep 05 '23 22:09 Shancong-Mou

The final alpha (i.e., the inner learning rate) was added to the algorithm, because Neumann(I - T) = T^{-1}, so if we use T = alpha*H, then the Neumann series gets (alphaH)^{-1} = 1/alpha * H^{-1}, so the final alpha cancels this off to just get H^{-1}.

Its important to include alpha in the Neumann series, else it will likely not be contracting and not converge.

But, it is not important to apply this final alpha to the returned value, because it will just scale the final hypergradient, which can just be included in whatever scalar you apply to the learning rate. Or perhaps cancelled off by some adaptive scheme like adam.

lorraine2 avatar Sep 05 '23 23:09 lorraine2

That you for the detaield explaination!

Shancong-Mou avatar Sep 05 '23 23:09 Shancong-Mou