papernotes
papernotes copied to clipboard
CVPR 2020 Tutorial Talk: Automated Hyperparameter and Architecture Tuning
Metadata
- Speaker: Cedric Archambeau
- Video: https://www.youtube.com/watch?v=pB1LmZWK_N8
- Website (more about HPO and NAS): https://hangzhang.org/CVPR2020/
- See more on my NAS survey: #55
Hyper-parameter Tuning as Global Optimization Problem
- Objective: define an metric to optimize. E.g., Validation error.
- We usually do random search or grid search.
- They are easy to parallelize but computationally expensive, since the search space grows exponentially with the number of hyper-parameter.
- Can we do better than random search and grid search for searching hyper-parameters?
Bayesian Optimization
- A probabilistic surrogate model of the objective.
- This surrogate model captures the uncertainty or confidence of the estimates.
- Performs an efficient grid search by balancing exploration and exploitation. (Sounds like RL!)
- More illustration: Assume we have an initial fixed budget, we randomly assign to different 4 slot machines. The returns of this 4 slot machines are illustrated in the above figure. If we only look at the mean return, we might exploit the 3rd machine. However if we also look at the uncertainty σ, we might wanna explore the 4th machine.
- Can be formularized in different ways, such as Guassian Processes (as the surrogate model).
- See more on A Tutorial on Bayesian Optimization.
The idea of surrogate model is also used in Efficient Progressive Neural Architecture Search.
(Canonical) Algorithm of Bayesian Optimization
- Function f is the objective (e.g., validation metric)
- x is our hyper-parameter estimates.
Bayesian Optimization with Gaussian Processes (GP) Surrogate Model
Ingredient 1: GP
- Primer: Multivariate Gaussian(mean_vector m, covariance_matrix K) can be used for sampling finite random variables. GP can be think of the generalization of Multivariate Gaussian.
- The (prior) GP(mean_function m(x), covariance_function k(x, .)) can be used for sampling random functions (or infinite random variables).
- The posterior GP(μ(x), σ(x)^2) can be computed in closed form after receiving new observation (x_.
- Once the posterior GP is computed, we can draw functions from it.
Ingredient 2: Acquisition Function
- Acquisition function aims to select hyper-parameter candidate x_new for us: What is the value (like value function in RL) of the given candidate.
- The value of x -- a(x|D) (D=observations) is, the expectation of the utility function of evaluations A(f, x) (Remind that the f is sampled from the posterior GP, so it is stochastic).
- We can use off-the-shelf solvers to compute the expectation.
- The acquisition function makes the exploration and exploitation trade-off.
- Two families of acquisition function: Improvement-based (e.g., PI, EI, UCB) and entropy-based (e.g, ES, PES, MES).
Another Surrogate Models instead of GP?
- SMAC: Random Forests.
- TPE: Tree of Parzen estimators.
Multi-fidelity Optimization
- Bayesian optimization is (surrogate) model-based and sequential, where multi-fidelity optimization is more random-based and parallel.
- Can we use cheap approximation of the objective f(x)?
- Successive halving: Fixed schedule for early stopping (the bottom-50% performing models, and let the top-50% performing models continue).
- Hyperband (HB): A variant of successive halving, where we have different schedules of early stopping (each group shares the same computational resource level, or fidelity r).
Model-based Multi-fidelity Optimization
- Hyperband is still uniformly sampling the hyper-parameter candidates, can we combine model-based (Bayesian optimization? Yes.
- BO-HB: TPE-based Hyperband for each fidelity r.
Asynchronous Extensions
- The above methods have their asynchronous extensions, e.g., A-BO, A-RS, A-BOHB, A-HB,....
NAS
- NAS Benchmark (ICLR'20 Spotlight): Contains 15625 architectures on 3 different image classification datasets.