Jonathan Lorraine

Results 7 comments of Jonathan Lorraine

The architecture parameters update by attempting to differentiate through the gradient of the elementary weight update - see Eq. 5 in the paper. _compute_unrolled_model helps to implement this.

Hello, Thanks for your interest in the work! With regards to your question: You ask about the motivation for using the learning rate as a scale in the Neumann series....

The AISTATS version is correct. We had a sign reversal typo in the arXiv version. Sorry about any confusion for that!

The final alpha (i.e., the inner learning rate) was added to the algorithm, because Neumann(I - T) = T^{-1}, so if we use T = alpha*H, then the Neumann series...

We estimate the terms over mini-batches, which are are the same batch size we do for optimizing the training loss. Perhaps other sized mini-batches - ex., bigger - could be...

In the paper we were not explicit about when to use mini batches with that notation. But, in practice, every time we evaluate L_t and L_v for updates we use...

For gradient accumulation I meant: first evaluate the hypergradient (i.e., the hyperparameter update) using whatever batch size fits in memory (ex., bs=1 for Lv and Lt). Then store this hypergradient...