ofxMSAmcts
ofxMSAmcts copied to clipboard
UCT score for child selection
Not sure if the following is a real issue.
Given that:
- each node has an associated score (
value); - the score is from the POV of the active player in the specific node (e.g. see
update()wherevalue += rewards[agent_id];)
is the get_best_uct_child function correct?
Specifically the line:
float uct_exploitation = (float)child->get_value() / (child->get_num_visits() + FLT_EPSILON);
encourages the selection of nodes with a greater value from the POV of the non-active player. Shouldn't it be something like:
float uct_exploitation = (float)child->get_value(active_id) / (child->get_num_visits() + FLT_EPSILON);
i.e. get_value should return a value from the POV of a specific player.
A basic example:
ROOT (active_id == 0)
+---CHILD1 (active_id == 1, value = 100, visits = 100)
|---CHILD2 (active_id == 1, value = 100, visits = 100)
+---CHILD3 (active_id == 1, value = 3, visits = 100)
get_best_uct_child favours CHILD1 and CHILD2 (greater get_value() / get_num_visits() ratio). Is it right?