muzero-general Question about the perspective transformation of two players when calculating Q?

Question about the perspective transformation of two players when calculating Q?

Open puyuan1996 opened this issue 1 year ago • 0 comments

Thanks for you open-sourced code very much.

I am very confused about this code segment in backpropagate method in self_play.py: when len(self.config.players) is 2,

in line 423： min_max_stats.update(node.reward + self.config.discount * -node.value()), why we use -node.value()) rather than node.value()) here, in my understanding, node.value() is calculated from the perspective of the player corresponding to the node .
in line 425： value = ( -node.reward if node.to_play == to_play else node.reward ) + self.config.discount * value when node.to_play == to_play is True, why we use -node.reward + self.config.discount * value rather than node.reward + self.config.discount * value here, ?
Is it because node.reward is obtained from the perspective of the parent node of the current node?

Looking forward to your reply！

Oct 25 '22 14:10 puyuan1996