pints icon indicating copy to clipboard operation
pints copied to clipboard

Add more optimisers

Open MichaelClerx opened this issue 6 years ago • 9 comments

Creating a big ticket using the list from/for the paper, closing smaller issues.

Ultimately, this should be closed in favour of much more specific tickets ("implement method x")

Optimisers

Semi-organised list of optimisation methods.

Italicised bits are my attempt at a one-line summary

Reference should be given if possible.

Things of interest:

  • Constrained or unconstrained (or (un)constrained with caveats)
    • Sometimes either one is required
  • Derivatives used:
    • Direct search: Sample f(x), but don't try to approximate or use a gradient
  • Smoothness requirements
    • Should f be differentiable? Twice-differentiable?
  • Heuristic vs comes-with-proof-of-convergence
    • Note though: Convergence doesn't mean the answer is the global best
    • So maybe just ignore this?

Derivative-free Direct methods (aka Search methods)

Completely ignore the idea of f having a derivative

Brute-force exploration

Look at the landscape and pick the lowest point

  • Constrained
  • Uniform sampling
  • Latin hypercube sampling
  • Should we generalise this as sampling-based optimisation?

Random search

  • Random search (RS)

    • Sample points on a hypersphere around current, jump if better
    • Fixed step size random search (FSSRS)
    • Optimum step size random search (OSSRS) Schumer and Steiglitz
    • Adaptive step size random search (ASSRS) Schumer and Steiglitz
    • Optimized relative step size random search (ORSSRS) Schrack and Choit
  • Random optimization

    • Sample points from a distribution, jump if better
    • Usually a normal distribution
    • Luus-Jaakola: uniform distribution
  • Simulated annealing

    • First paper: https://doi.org/10.1126/science.220.4598.671
    • Lots of versions. One is_Jump to nearby points with a P dependent on f; jump almost always at the start (high temperature) but with a likelihood that decreases every iteration (cooling)_
    • Global method
  • Stochastic tunneling (STUN)

    • Search using a metropolis criterion for random jumps, and a transformation of f intended to let it 'tunnel' between minima
    • Global method
  • Parallel tempering

    • Like simulated annealing but randomly hop between temperatures
    • Global method

Coordinate search (aka Coordinate descent, aka Compass search)

Evaluate f at nearby points at distance d in each direction, if better go there and end iteration, if no better point found decrease d

  • See book: "Introduction to derivative-free optimization"
  • Type of "hill climbing" algorithm
  • Adaptive coordinate descent
    • Perform CS, but adapt (transform) the coordinate vectors to obtain maximum decorrelation along coordinates
  • Stochastic coordinate descent (aka stochastic hill climbing)
    • Sometimes go the wrong way

Search & poll / Pattern search

  • Search: Evaluate f at a finite number of points, jump to lower if found
  • Poll: If no lower f found, perform local search, with some iteratively decreasing distance parameter
  • Fancy versions can use e.g. surrogate models
  • Should converge to a mode, provided f is smooth
  • Rosenbrock's method (1960) https://doi.org/10.1093/comjnl/3.3.175
  • Audett & Dennis: Pattern search filter methods (2004)
    • Generalized pattern search (GPS)
    • Generating set search (GSS)
    • Mesh adaptive search (MADS)

Simplex methods

  • Simplex = extension of line segment / triangle / tetrahedron to any dimension
  • ~Nelder-Mead (aka downhill simplex)~
    • Maintain a simplex, at each iteration replace its worst point by a value determined by the values of the other points
    • Many extensions to avoid getting stuck
    • Unconstrained
  • Subplex Nelder-Mead on "a sequence of subspaces"
    • https://dl.acm.org/citation.cfm?id=100816

Tabu search

Perform local search (e.g. random or coordinate search), but allow moves to worse f if at an optimum; but remember previously visited points and disallow those (mark them taboo).

Dividing rectangles algorithm (DIRECT)

Partitions the (bounded) subspace into rectangles, and works out where the optimum could be for any value of the Livschitz constant (which says how fast a function varies with its input), then subdivides all rectangles where it could possible be

  • 1993
  • Global method
  • requires boundaries
  • https://doi.org/10.1007/BF00941892
  • Locally biased DIRECT (DIRECT-L)
    • https://doi.org/10.1023/A:1017930332101
    • 2001

Powell's conjugate direction method

  • Brent's method (aka Brent-Dekker method)
    • aka Principal axis (PRAXIS) https://people.sc.fsu.edu/~jburkardt/py_src/praxis/praxis.html
    • Book: Richard Brent, Algorithms for minimization without derivatives, 1972
  • Golden-section line search
    • Define a range around the optimum and then disect it using golden ratio cuts

Multi-start methods

Start from several points to avoid local optima

Multi-level single-linkage (MLSL)

Multi-start from uniform places densely in the search space should find all optima, but is expensive, so use clustering methods to try and identify clusters of starting points with the same attractor

  • https://doi.org/10.1007/BF02592070
  • 1987
  • Requires boundaries
  • Global method
  • Hybrid method: Can use any local optimiser inside!

Basin-hopping

Repeatedly perform and then perturb a local search

  • https://doi.org/10.1155/2012/674832

Derivative-free evolutionary methods and metaheuristics

Maintain some population of points that gets updated at each iteration

Genetic algorithms (GA)

Generate new points using crossover and mutation, estimate the fitness of all points in the population, remove the points with the worst scores

  • Note this depends on a ranking of fitness, so it may be possible to avoid evaluating f(x) if something that gives the same rankings is available

Differential evolution

Loop over each point in a population, generate a new proposal for each point (as in GA), and replace point with that proposal if its f is lower

  • Differential evolution
  • Self-adaptive DE (jDE, iDE, and pDE)

Evolution strategies

Like GA, but mutate by adding a normally distributed random value to each particle - no crossover

  • Can use ranking instead of objective function (see GA)

  • Stochastic ranking for constrained evolutionary optimisation

    • https://doi.org/10.1109/4235.873238
    • 2000
    • Global method
  • Improved stochastic ranking evolution strategy (ISRES)

    • https://doi.org/10.1109/TSMCC.2004.841906
    • 2005
    • Global method
  • (N+1)-ES Simple Evolutionary Algorithm

Controlled random search (CRS)

Evolve a random population using a Heuristic that's a bit like a randomised Nelder-Mead iteration

  • Requires constraints
  • https://doi.org/10.1093/comjnl/20.4.367
  • 1977
  • CRS2 with local mutation
    • https://doi.org/10.1007/s10957-006-9101-0
    • 2006

Swarm algorithms

  • ~Particle-swarm optimisation~

    • Several particles buzz around the search space randomly, but with a bias towards the previous personal and global best
    • Unconstrained / Constrained-with-caveats
    • Particle-based method
    • Evolutionary / biologically inspired method
    • No proof of convergence
    • Global method
  • Ant colony optimization (see wikipedia)

  • Artificial bee colony algorithm (see wikipedia)

  • Artifical swarm intelligence

  • Bees algorihtm (see wikipedia)

  • Cuckoo search (see wikipedia)

  • Glowworm swarm optimization (see wikipedia)

  • Particle swarm optimization generational (GPSO)

Metaheuristics

  • Adaptive dimensional search (see wikipedia)
  • Bat algorithm (see wikipedia)
  • Colliding bodies optimisation (see wikipedia)
  • Cuttlefish optimization (see wikipedia)
  • Duelist algorithm (see wikipedia)
  • Flower polination algorithm (see wikipedia)
  • Firefly algorithm (see wikipedia)
  • Gaussian adaptation (see wikipedia)
    • Apparently "based on information theory"
  • Gravitational search algorithm (see wikipedia)
  • Harmony search (see wikipedia)
    • Improved Harmony Search
  • Hunting search (see wikipedia)
  • Hydrological cycle algorithm (see wikipedia)
  • Imperialist competitive algorithm (see wikipedia)
  • Intelligent water drops algorithm (see wikipedia)
  • Killer whale algorithm (see wikipedia)
  • Mass and energy balances algorithm (see wikipedia)
  • Memetic algorithm (see wikipedia)
  • Rain water algorithm (see wikipedia)
  • River formation dyanmics (see wikipedia)
  • Spiral optiomization (see wikipedia)

Bayesian optimistation

Treat f as an unknown distribution, define a prior over it, generate samples from f, combine with prior to form posterior, repeat.

Gradient-estimating methods

Assume f' exists, and then approximate it

Finite difference methods

Approximate f' with finite differences, and find its root

  • Quasi-newton methods
    • "Iteratively guess the root is where the tangent hits the x-axis"
    • Unconstrained

Simplex gradient methods / Implicit filtering

"Line-search using the simplex gradient"

Natural evolution strategies (NES)

Use a search population to estimate the "natural gradient", a gradient which takes different scaling of the coordinates into account

  • ~Covariance matrix adaptation evolution strategy (CMA-ES)~
  • ~Exponential natural evolution strategy (xNES)~
  • ~Separable natural evolution strategy (SNES)~

Surrogate-model methods

Replace f by some function g for which g' is easy to find

Trust-region methods

Select a region of search space, approximate f (typically with a quadratic function), and move to its minimum

  • Powell's Constrained Optimisation By Linear approximations COBYLA (constrained), linear function
  • Powell's UOBYQA (unconstrained), quadratic function
  • Powell's NEWUOA (unconstrained), quadratic function
    • Succeeeded by BOBYQA
  • Powell's BOBYQA (constrained), quadratic function
  • Powell's LINCOA (constrained), quadratic function
  • Trust region reflective method (TRR)
  • Dog-leg trust-region algorithm

Data-based Online Nonlinear Extremumseeker (DONE)

Use Fourier-based model, then optimize on that with L-BFGS

Methods requiring the 1st-order gradient

Root finding methods

Find a point where f'(x) = 0 (and hope there's only one)

  • Newton's method
    • _Iteratively guess that the root is where the tangent hits the x-axis
    • Unconstrained
    • Truncated Newton methods
      • https://doi.org/10.1007/BF02592055
  • Levenberg-Marquardt algorithm
  • Broyden-Fletcher-Goldfarb-Shanno (BFGS)
    • Attempts to approximate hessian, works best if function is twice-differentiable
    • Unconstrained
    • BFGS-B is constrained
    • Limited-memory BFGS (L-BFGS or LM-BFGS)
      • Like BFGS but with linear memory requirement, so good for high numbers of variables

Gradient descent (aka Steepest descent)

  • Unconstrained / Constrained-with-projection
  • ~Basic gradient descent~
    • Walk down the steepest direction
    • Step size is fraction of gradient: sensitive parameter!
  • Conjugate gradient descent
  • Stochastic gradient descent (SGD)
    • Approximate or partially evaluate the gradient
    • For example, calculate the gradient in each direction separately, and perform a separate step in each direction
    • Implicit updates (ISGD)
      • Use gradient information after the step
    • ~Adaptive subgradient methods (AdaGrad)~ #1468
    • ~RMSProp or AdaDelta (very similar)~ #1470
    • ~Adaptive Moment Estimation (Adam)~
    • Natural gradient stochastic descent
      • Kalman-based stochastic gradient descent https://arxiv.org/abs/1512.01139

Continuation

Define a class of functions F(x,k), so that f'(x)=F(x,1), and some solution F(x*, 0) = 0 is known, then 'continue' k to find the x where f'(x)=0

Others

  • Conservative convex separable approximation (CCSA)

    • https://doi.org/10.1137/S1052623499362822
    • 2006
    • Must have inequality constraints
    • Can deal with very high-dimensional problems
    • Some relation to Sequential Quadratic Programming (SQP)
    • Based on Method of Moving Asymptotes (MMA)
  • Shifted limited-memory variable metrics methods

    • https://doi.org/10.1016/j.cam.2005.02.010
    • 2006

Methods requiring the 2nd-order gradient

?

Good resources for optimisers:

MichaelClerx avatar Feb 12 '19 07:02 MichaelClerx

This one sounds a bit different: https://link.springer.com/article/10.1007%2Fs10957-013-0354-0 But even the authors themselves didn't implement it...

MichaelClerx avatar Sep 07 '19 10:09 MichaelClerx

Here's a book by those authors. Seems mostly classical again, although chapter 2 is on GAs which is encouraging https://www.amazon.com/Derivative-Free-Optimization-Operations-Financial-Engineering/dp/3319689126/ref=sr_1_1?keywords=derivative-free+and+black+box+optimization&qid=1567851546&s=gateway&sr=8-1

MichaelClerx avatar Sep 07 '19 10:09 MichaelClerx

@martinjrobins says one of the global optimisers in scipy is quite good:

"Simplicial homology global optimization", or SHGO

MichaelClerx avatar Nov 12 '19 12:11 MichaelClerx

SHGO explanation and pseudocode here:

https://stefan-endres.github.io/shgo/files/shgo_defense.pdf

says convergence to the global minima is guaranteed for Lipschitz smooth functions, looks very mathsy!

martinjrobins avatar Nov 13 '19 17:11 martinjrobins

Looks interesting though!

MichaelClerx avatar Nov 13 '19 17:11 MichaelClerx

DIRECT sounds worth a look: https://link.springer.com/article/10.1007/s10898-020-00952-6

MichaelClerx avatar May 25 '21 14:05 MichaelClerx

Some papers about global ones: https://link.springer.com/search?query=&search-within=Journal&facet-journal-id=10898

Not many entries in 2021 on our type of problem

MichaelClerx avatar May 25 '21 15:05 MichaelClerx

Good place to look: https://numbbo.github.io/data-archive/

In particular,

  • https://numbbo.github.io/data-archive/bbob-noisy/
  • https://numbbo.github.io/ppdata-archive/

MichaelClerx avatar Jun 08 '22 10:06 MichaelClerx

PSO with proofs: https://arxiv.org/abs/2205.04880

https://github.com/akashspace/Consensus-based-opmization

MichaelClerx avatar Jun 14 '22 09:06 MichaelClerx