Ray icon indicating copy to clipboard operation
Ray copied to clipboard

Question about Progressive Widening

Open CGLemon opened this issue 3 years ago • 1 comments

Yuki Kobayashi:

According to your paper, I try to add the Progressive Widening in the MCTS. But I find the PV string is too long. Here is it. It is just only around 5000 playouts. I think it is weird result (The most visits child is PV node). The PV string length of Ray in the same condition is 2~4 (use lz-analyze with Sabaki).

Screenshot from 2022-10-23 11-11-03

Here is my simple implementation.

Progressive Widening

The recursive formula is

$t_{n+1} = t_n + 40 \times 1.4 n,\ t_0 = 0$

Assume $k$ is opened children number of current node. The $v$ is visits number of current node. It must be satisfied

$t_{k-1} < v < t_{k}$

UCT Selection

The UCT formula is borrowed from Ray.

$UCT_i = Q_i + C_{param} \sqrt{ \frac{ \Sigma n_a }{ n_i + 1 } } + 0.35 \times rate \times \sqrt{ \frac{1000}{1000 + \Sigma n_a} }$

  1. $\Sigma n_a$ means the visits of current node.
  2. $rate$ means the MM priority (already be normalized).

Search ordering

I order the child by MM priority. The best priority will be opened first.

I have no ideal what's wrong. What should I doubt first? Thanks!

CGLemon avatar Oct 23 '22 03:10 CGLemon

Length of PV depends on the number of expanded nodes. If your program expands a node when a leaf node visit is only one time, length of PV will be long. Ray expands a node every 40 visits on 19x19 (visit count depends on EXPAND_THRESHOLD parameter). When you use value network, EXPAND_THRESHOLD must be 1. On the other hand, when you use only Monte-Carlo simulation, EXPAND_THRESHOLD must be adjusted.

kobanium avatar Oct 24 '22 17:10 kobanium

Thank! The PV string looks better now.

CGLemon avatar Oct 28 '22 17:10 CGLemon

Do you have any ideal why we apply the EXPAND_THRESHOLD? Why is this trick work? I want to write the theory in my log. Thanks!

CGLemon avatar Oct 30 '22 16:10 CGLemon

When we used only Monte-Carlo simulation for evaluating positions, evaluations are unstable. Each simulation has a mix of wins and loses. This is because Monte-Carlo simulation uses random number for match simulations. EXPAND_THRESHOLD is used for waiting for evaluations converge. The number of EXPAND_THRESHOLD depends on style of Monte-Carlo simulation. Pachi and Fuego have much smaller EXPAND_THRESHOLD. Maybe the use of RAVE is one of factors.

kobanium avatar Nov 09 '22 17:11 kobanium