Jonathan Lorraine comments

Results 7 comments of


                                            Jonathan Lorraine

about _compute_unrolled_model

The architecture parameters update by attempting to differentiate through the gradient of the elementary weight update - see Eq. 5 in the paper. _compute_unrolled_model helps to implement this.

Scaling the hessian by learning rate

Hello, Thanks for your interest in the work! With regards to your question: You ask about the motivation for using the learning rate as a scale in the Neumann series....

Scaling the hessian by learning rate

The AISTATS version is correct. We had a sign reversal typo in the arXiv version. Sorry about any confusion for that!

Scaling the hessian by learning rate

The final alpha (i.e., the inner learning rate) was added to the algorithm, because Neumann(I - T) = T^{-1}, so if we use T = alpha*H, then the Neumann series...

How to calculate d_train_loss_d_w

We estimate the terms over mini-batches, which are are the same batch size we do for optimizing the training loss. Perhaps other sized mini-batches - ex., bigger - could be...

How to calculate d_train_loss_d_w

In the paper we were not explicit about when to use mini batches with that notation. But, in practice, every time we evaluate L_t and L_v for updates we use...

How to calculate d_train_loss_d_w

For gradient accumulation I meant: first evaluate the hypergradient (i.e., the hyperparameter update) using whatever batch size fits in memory (ex., bs=1 for Lv and Lt). Then store this hypergradient...