muzero-general icon indicating copy to clipboard operation
muzero-general copied to clipboard

Question about the perspective transformation of two players when calculating Q?

Open puyuan1996 opened this issue 1 year ago • 0 comments

Thanks for you open-sourced code very much.

I am very confused about this code segment in backpropagate method in self_play.py: when len(self.config.players) is 2,

  • in line 423min_max_stats.update(node.reward + self.config.discount * -node.value()), why we use -node.value()) rather than node.value()) here, in my understanding, node.value() is calculated from the perspective of the player corresponding to the node .

  • in line 425 value = ( -node.reward if node.to_play == to_play else node.reward ) + self.config.discount * value when node.to_play == to_play is True, why we use -node.reward + self.config.discount * value rather than node.reward + self.config.discount * value here, ?

  • Is it because node.reward is obtained from the perspective of the parent node of the current node?

Looking forward to your reply!

puyuan1996 avatar Oct 25 '22 14:10 puyuan1996