papernotes
papernotes copied to clipboard
Model Agnostic Meta Learning Algorithms
trafficstars
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML)
- Authors: Chelsea Finn, Pieter Abbeel, Sergey Levine
- Organization: UC Berkeley & OpenAI
- Conference: ICML 2017
- Paper: https://arxiv.org/abs/1703.03400
- Code: https://github.com/cbfinn/maml
On First-Order Meta-Learning Algorithms (Reptile)
- Authors: Alex Nichol and Joshua Achiam and John Schulman
- Organization: OpenAI
- Paper: https://arxiv.org/abs/1803.02999
Probabilistic Model-Agnostic Meta-Learning
- Authors: Chelsea Finn, Kelvin Xu, Sergey Levine
- Organization: UC Berkeley & OpenAI
- Conference: NIPS 2018
- Paper: https://arxiv.org/abs/1806.02817
Bayesian Model-Agnostic Meta-Learning
- Authors: Taesup Kim, Jaesik Yoon, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, Sungjin Ahn
- Organization: Element AI, MILA, SAP, Kakao Brain, CIFAR Senior Fellow, Rutgers University
- Conference: NIPS 2018
- Paper: https://arxiv.org/abs/1806.03836
Meta-Learning with Latent Embedding Optimization (LEO)
- Authors: Andrei A. Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero & Raia Hadsell
- Organization: DeepMind
- Conference: ICLR 2019
- Paper: https://arxiv.org/abs/1807.05960
How to Train Your MAML (MAML++)
- Authors: Antreas Antoniou, Harrison Edwards, Amos Strokey
- Organization: University of Edinburgh & OpenAI
- Conference: ICLR 2019
- Paper: https://arxiv.org/abs/1810.09502
- Code: https://github.com/AntreasAntoniou/HowToTrainYourMAMLPytorch
MAML
- A model-agnostic (only gradient decent) meta learning algorithm aims to find a good initialization point for model such that it can be fine-tuned quickly on new tasks.
- Shows STOA on few-shot image classification, regression and fast fine-tuning for policy gradient.
How does it work?

- Sample a batch of tasks (with few training data and "virtual" test data; the "virtual" test data is constructed from training data).
- For each task_{i}
- Compute gradient w.r.t L(θ) on training data and update model θ -> θ'_{i}.
- Compute L(θ'_{i}) on virtual test data.
- Compute gradient w.r.t. Σ_{i} L(θ'_{i}) and update model θ.
- The loss can be cross-entropy loss for classification, MSE for regression and reward for RL.
FOMAML
- They also tried ignoring second derivatives θ'{i} in task{i}, just update θ using Σ_{i} L(θ_{i}) (no need train/test splits in original task setup).
- This is denoted as first-Order MAML (FOMAML).
- Shows comparable to second-order MAML and save more computation time.
Reptile
- Introduce a new FOMAML algorithm, Reptile, which works by repeatedly sampling a task, training on it, and moving the initialization towards the trained weights on that task.
- Really worth-reading, especially its analysis on SGD and MAML!
How does it work?

- The U^{k}_{T} means we take k gradient updates in the sampled task T; ϵ = 1/α, i.e. learning rate.
- We can update in batch version (n = number of tasks):

- If we only take k = 1 update, this is essentially SGD on the expected loss.
- If we take k > 1 updates, it is not. This will converge differently which considers second-and-higher derivatives.
- Experiment on 5-shot 5-way Omniglot shows different inner-loop gradient combinations:

Why does it work?
- Through Taylor expansion analysis, both MAML and Reptile contain the same leading order terms:
- First: minimizing the expected loss (joint training on different tasks).
- Second: maximizing within-task generalization, i.e., maximizing the inner product between k gradients from the same task. If gradients from different batch has positive inner product, then taking a gradient step on one batch improves performance on the other batch.
- The result of Taylor expansion on SGD and MAML (i in [1, k]):

- This explains k = 2 in the above experiment is still insufficient since it puts less weight on the second inner product term relative to the first term.