Optimization-and-Search
Optimization-and-Search copied to clipboard
Implementation and visualization (some demos) of search and optimization algorithms.
Optimization-and-Search
Implementation and visualization of optimization algorithms.
please add MAthJax Plugin for Github to your browser.
Table of Contents
- Optimization-and-Search
- Table of Contents
- 1. Numerical Optimization
- Gradient Descent
- Conjugate Descent
- Newton Method
- 2. Stochastic Search
- Simulated Annealing
- Cross-Entropy Methods
- Search Gradient
- 3. Classical Search
- A* search algorithm (A star search algorithm)
- minimax search
- 4. Dynamic Programming
- Value iteration
- Policy iteration
- 5.
- Monte Carlo Policy Evaluation
- Temporal Difference Policy Evaluation
- Tabular Q learning
- 6.
- Deep Q learning
- 7.
- Monte Carlo Tree Search
- 8.
- DPLL
1. Numerical Optimization
Here we are trying to solve simple quadratic problem.
$$\arg \text{min } \frac{1}{2}x^TQx + b^Tx$$
$$Q \geq 0, x \in R^n$$
The animation(left) is tested on N=10, (right) for n=2;
Gradient Descent
using first-order gradient and learning rate
Conjugate Descent
$x^{k+1} := x^k + a_kd^k$
using line search to compute the step size $\alpha$
$$a_k = \frac{\nabla f(x^k)^Td^k}{(d^k)^TQd^k}$$
find conjugate direction
$$d^{k+1} = -\nabla f(x^{k+1}) + \beta_kd^k$$
$$ \beta_k = \frac{\nabla f(x^{k+1})^T\nabla f(x^{k+1})}{\nabla f(x^k)^T\nabla f(x^k)} \text{ (FR)}$$
Newton Method
using second-order gradient
$x^{k+1} := x^k + d^k$
$d^k = -[\nabla^2 f(x^k)]^{-1}\nabla f(x^k)$
2. Stochastic Search
Here we try to use stochastic search to solve TSP problems.
Simulated Annealing
Cross-Entropy Methods
cross entropy using less steps to get converged
Search Gradient
Here trying to find lowest position. Since for TSP, I cannot find $\nabla_\theta \log (p_\theta (z_i))$. If you have any idea please be free to comment. Thanks!
3. Classical Search
A* search algorithm (A star search algorithm)
The code is writen according to the slide in CSE257, UCSD.
minimax search
4. Dynamic Programming
Value iteration
We can see we get the optimal policy before our utility function converges.
Policy iteration
5.
Monte Carlo Policy Evaluation
- Policy: $\pi (s_i) = a_i$
- Trajectories: $s_0, s_1, s_2, .., s_T$
- Rewards: $R(s_0) + \gamma R(s_1) + .. + \gamma^TR(s_T)$
Follow policy and get a lot of samples of trajectories, and estimate the expectation $$V^{\pi}(s_0) = \mathbb{E}\pi [\sum{i=0}^T\gamma^iR(s_i)]$$
Temporal Difference Policy Evaluation
The difference between TD and MC method is that TD could update in every step and do not necessarily wait for a whole trajectory to generate.
Tabular Q learning
Off policy method $$V(s) = \max_a (\mathbb{E}[R(s)+\gamma V(s'|s,a)])$$
6.
Deep Q learning
In this case I used Deep Q network to train a self-driving car. The environment is created by NeuralNine (github), thought he used NEAT package to train the car.
The demo is below: you can modify the network and train your own car!
Random case: