UCT score for child selection

Open morinim opened this issue 7 years ago • 0 comments

Not sure if the following is a real issue.

Given that:

each node has an associated score (value);
the score is from the POV of the active player in the specific node (e.g. see update() where value += rewards[agent_id];)

is the get_best_uct_child function correct?

Specifically the line:

float uct_exploitation = (float)child->get_value() / (child->get_num_visits() + FLT_EPSILON);

encourages the selection of nodes with a greater value from the POV of the non-active player. Shouldn't it be something like:

float uct_exploitation = (float)child->get_value(active_id) / (child->get_num_visits() + FLT_EPSILON);

i.e. get_value should return a value from the POV of a specific player.

A basic example:

ROOT (active_id == 0)
   +---CHILD1  (active_id == 1, value = 100, visits = 100)
   |---CHILD2  (active_id == 1, value = 100, visits = 100)
   +---CHILD3  (active_id == 1, value = 3, visits = 100)

get_best_uct_child favours CHILD1 and CHILD2 (greater get_value() / get_num_visits() ratio). Is it right?

Jan 24 '18 13:01 morinim