papernotes Model Agnostic Meta Learning Algorithms

trafficstars

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML)

Authors: Taesup Kim, Jaesik Yoon, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, Sungjin Ahn
Organization: Element AI, MILA, SAP, Kakao Brain, CIFAR Senior Fellow, Rutgers University
Conference: NIPS 2018
Paper: https://arxiv.org/abs/1806.03836

Authors: Andrei A. Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero & Raia Hadsell
Organization: DeepMind
Conference: ICLR 2019
Paper: https://arxiv.org/abs/1807.05960

May 11 '19 07:05 howardyclo

A model-agnostic (only gradient decent) meta learning algorithm aims to find a good initialization point for model such that it can be fine-tuned quickly on new tasks.
Shows STOA on few-shot image classification, regression and fast fine-tuning for policy gradient.

Sample a batch of tasks (with few training data and "virtual" test data; the "virtual" test data is constructed from training data).
For each task_{i}
- Compute gradient w.r.t L(θ) on training data and update model θ -> θ'_{i}.
- Compute L(θ'_{i}) on virtual test data.
Compute gradient w.r.t. Σ_{i} L(θ'_{i}) and update model θ.
The loss can be cross-entropy loss for classification, MSE for regression and reward for RL.

They also tried ignoring second derivatives θ'{i} in task{i}, just update θ using Σ_{i} L(θ_{i}) (no need train/test splits in original task setup).
This is denoted as first-Order MAML (FOMAML).
Shows comparable to second-order MAML and save more computation time.

May 11 '19 09:05 howardyclo

Introduce a new FOMAML algorithm, Reptile, which works by repeatedly sampling a task, training on it, and moving the initialization towards the trained weights on that task.
Really worth-reading, especially its analysis on SGD and MAML!

The U^{k}_{T} means we take k gradient updates in the sampled task T; ϵ = 1/α, i.e. learning rate.
We can update in batch version (n = number of tasks):
If we only take k = 1 update, this is essentially SGD on the expected loss.
If we take k > 1 updates, it is not. This will converge differently which considers second-and-higher derivatives.
Experiment on 5-shot 5-way Omniglot shows different inner-loop gradient combinations:

Through Taylor expansion analysis, both MAML and Reptile contain the same leading order terms:
- First: minimizing the expected loss (joint training on different tasks).
- Second: maximizing within-task generalization, i.e., maximizing the inner product between k gradients from the same task. If gradients from different batch has positive inner product, then taking a gradient step on one batch improves performance on the other batch.
The result of Taylor expansion on SGD and MAML (i in [1, k]):
This explains k = 2 in the above experiment is still insufficient since it puts less weight on the second inner product term relative to the first term.

May 11 '19 11:05 howardyclo