papernotes
papernotes copied to clipboard
A Recipe for Training Neural Networks
trafficstars
Metadata
- Author: Andrej Karpathy
- Blog post: http://karpathy.github.io/2019/04/25/recipe/
- TL;DR. Just a shorter notes (that are useful for me) of Karpathy's blog post ;-)
Spend Time to Understand Data
- Understand the distribution and patterns.
- Look for data imbalances and biases.
- Examples:
- Are very local features enough or do we need global context?
- How much variation is there and what form does it take?
- What variation is spurious and could be preprocessed out?
- Does spatial position matter or do we want to average pool it out?
- How much does detail matter and how far could we afford to downsample the images?
- How noisy are the labels?
- Visualize the statistics and the outliers along any axis.
Setup Training/Evaluation and Start from Simple Model
- Establish baselines and visualize the train/eval metrics.
- Fix random seed and run code twice to ensure to get the same results.
- Disable any unnecessary fanciness, e.g., data augmentation.
- Plot test losses of entire data instead of only batches.
- Ensure the loss started with right value, e.g.,
-log(1/n_classes). - Visualize the fixed test batch to see the "dynamics" of how model learns (be aware of very low or very high learning rates).
- Be aware of
viewandtranspose/permute. - Write simple code that can work first and refactor to more generalizable version later.
Overfit
- Follow the most related paper and try their simplest architecture that achieves good performance. Do not start to customize things in early stage.
- Use Adam optimizer with 3e-4 learning rate is safe.
- If we have multiple signals, plug them into model one by one to ensure that you get a performance boost you'd expect.
- Be careful with learning rate decay (different dataset size/problem requires different learning rate decay schedule). Disable learning rate decay first and tune this later.
Regularize
- Don't spend a lot of engineering costs to squeeze juice out of a small dataset when you could instead be collecting more data.
- Data augmentation.
- Creative augmentation: domain randomization, use of simulation, clever hybrids such as inserting (potentially simulated) data into scenes, or even GANs.
- Pretraining.
- Stick with supervised learning.
- Smaller input dimensionality (try input small image).
- Decrease the batch size (small batch = stronger regularization).
- Add dropout (dropout2d for CNNs).
- Weight decay.
- Try larger model (early stopped performance may be better than the small ones).
Tuning Hyperparameters
Squeeze Out the Juice
- Ensembles or "Distilling the Knowledge in a Neural Network".
- Leaving it training even the validation loss seems to be leveling off.