papernotes icon indicating copy to clipboard operation
papernotes copied to clipboard

A Recipe for Training Neural Networks

Open howardyclo opened this issue 6 years ago • 1 comments
trafficstars

Metadata

  • Author: Andrej Karpathy
  • Blog post: http://karpathy.github.io/2019/04/25/recipe/
  • TL;DR. Just a shorter notes (that are useful for me) of Karpathy's blog post ;-)

howardyclo avatar Jun 08 '19 09:06 howardyclo

Spend Time to Understand Data

  1. Understand the distribution and patterns.
  2. Look for data imbalances and biases.
  3. Examples:
    • Are very local features enough or do we need global context?
    • How much variation is there and what form does it take?
    • What variation is spurious and could be preprocessed out?
    • Does spatial position matter or do we want to average pool it out?
    • How much does detail matter and how far could we afford to downsample the images?
    • How noisy are the labels?
  4. Visualize the statistics and the outliers along any axis.

Setup Training/Evaluation and Start from Simple Model

  • Establish baselines and visualize the train/eval metrics.
  • Fix random seed and run code twice to ensure to get the same results.
  • Disable any unnecessary fanciness, e.g., data augmentation.
  • Plot test losses of entire data instead of only batches.
  • Ensure the loss started with right value, e.g., -log(1/n_classes).
  • Visualize the fixed test batch to see the "dynamics" of how model learns (be aware of very low or very high learning rates).
  • Be aware of view and transpose/permute.
  • Write simple code that can work first and refactor to more generalizable version later.

Overfit

  • Follow the most related paper and try their simplest architecture that achieves good performance. Do not start to customize things in early stage.
  • Use Adam optimizer with 3e-4 learning rate is safe.
  • If we have multiple signals, plug them into model one by one to ensure that you get a performance boost you'd expect.
  • Be careful with learning rate decay (different dataset size/problem requires different learning rate decay schedule). Disable learning rate decay first and tune this later.

Regularize

  • Don't spend a lot of engineering costs to squeeze juice out of a small dataset when you could instead be collecting more data.
  • Data augmentation.
  • Creative augmentation: domain randomization, use of simulation, clever hybrids such as inserting (potentially simulated) data into scenes, or even GANs.
  • Pretraining.
  • Stick with supervised learning.
  • Smaller input dimensionality (try input small image).
  • Decrease the batch size (small batch = stronger regularization).
  • Add dropout (dropout2d for CNNs).
  • Weight decay.
  • Try larger model (early stopped performance may be better than the small ones).

Tuning Hyperparameters

Squeeze Out the Juice

howardyclo avatar Jun 08 '19 09:06 howardyclo