papernotes icon indicating copy to clipboard operation
papernotes copied to clipboard

Model Agnostic Meta Learning Algorithms

Open howardyclo opened this issue 6 years ago • 2 comments
trafficstars

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML)

  • Authors: Chelsea Finn, Pieter Abbeel, Sergey Levine
  • Organization: UC Berkeley & OpenAI
  • Conference: ICML 2017
  • Paper: https://arxiv.org/abs/1703.03400
  • Code: https://github.com/cbfinn/maml

On First-Order Meta-Learning Algorithms (Reptile)

  • Authors: Alex Nichol and Joshua Achiam and John Schulman
  • Organization: OpenAI
  • Paper: https://arxiv.org/abs/1803.02999

Probabilistic Model-Agnostic Meta-Learning

  • Authors: Chelsea Finn, Kelvin Xu, Sergey Levine
  • Organization: UC Berkeley & OpenAI
  • Conference: NIPS 2018
  • Paper: https://arxiv.org/abs/1806.02817

Bayesian Model-Agnostic Meta-Learning

  • Authors: Taesup Kim, Jaesik Yoon, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, Sungjin Ahn
  • Organization: Element AI, MILA, SAP, Kakao Brain, CIFAR Senior Fellow, Rutgers University
  • Conference: NIPS 2018
  • Paper: https://arxiv.org/abs/1806.03836

Meta-Learning with Latent Embedding Optimization (LEO)

  • Authors: Andrei A. Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero & Raia Hadsell
  • Organization: DeepMind
  • Conference: ICLR 2019
  • Paper: https://arxiv.org/abs/1807.05960

How to Train Your MAML (MAML++)

  • Authors: Antreas Antoniou, Harrison Edwards, Amos Strokey
  • Organization: University of Edinburgh & OpenAI
  • Conference: ICLR 2019
  • Paper: https://arxiv.org/abs/1810.09502
  • Code: https://github.com/AntreasAntoniou/HowToTrainYourMAMLPytorch

howardyclo avatar May 11 '19 07:05 howardyclo

MAML

  • A model-agnostic (only gradient decent) meta learning algorithm aims to find a good initialization point for model such that it can be fine-tuned quickly on new tasks.
  • Shows STOA on few-shot image classification, regression and fast fine-tuning for policy gradient.

How does it work?

  • Sample a batch of tasks (with few training data and "virtual" test data; the "virtual" test data is constructed from training data).
  • For each task_{i}
    • Compute gradient w.r.t L(θ) on training data and update model θ -> θ'_{i}.
    • Compute L(θ'_{i}) on virtual test data.
  • Compute gradient w.r.t. Σ_{i} L(θ'_{i}) and update model θ.
  • The loss can be cross-entropy loss for classification, MSE for regression and reward for RL.

FOMAML

  • They also tried ignoring second derivatives θ'{i} in task{i}, just update θ using Σ_{i} L(θ_{i}) (no need train/test splits in original task setup).
  • This is denoted as first-Order MAML (FOMAML).
  • Shows comparable to second-order MAML and save more computation time.

howardyclo avatar May 11 '19 09:05 howardyclo

Reptile

  • Introduce a new FOMAML algorithm, Reptile, which works by repeatedly sampling a task, training on it, and moving the initialization towards the trained weights on that task.
  • Really worth-reading, especially its analysis on SGD and MAML!

How does it work?

  • The U^{k}_{T} means we take k gradient updates in the sampled task T; ϵ = 1/α, i.e. learning rate.
  • We can update in batch version (n = number of tasks):
  • If we only take k = 1 update, this is essentially SGD on the expected loss.
  • If we take k > 1 updates, it is not. This will converge differently which considers second-and-higher derivatives.
  • Experiment on 5-shot 5-way Omniglot shows different inner-loop gradient combinations:

Why does it work?

  • Through Taylor expansion analysis, both MAML and Reptile contain the same leading order terms:
    • First: minimizing the expected loss (joint training on different tasks).
    • Second: maximizing within-task generalization, i.e., maximizing the inner product between k gradients from the same task. If gradients from different batch has positive inner product, then taking a gradient step on one batch improves performance on the other batch.
  • The result of Taylor expansion on SGD and MAML (i in [1, k]):
  • This explains k = 2 in the above experiment is still insufficient since it puts less weight on the second inner product term relative to the first term.

howardyclo avatar May 11 '19 11:05 howardyclo