muzero-general
muzero-general copied to clipboard
Question about the perspective transformation of two players when calculating Q?
Thanks for you open-sourced code very much.
I am very confused about this code segment in backpropagate method in self_play.py: when len(self.config.players) is 2,
-
in line 423:
min_max_stats.update(node.reward + self.config.discount * -node.value())
, why we use-node.value())
rather thannode.value())
here, in my understanding,node.value()
is calculated from the perspective of the player corresponding to thenode
. -
in line 425:
value = ( -node.reward if node.to_play == to_play else node.reward ) + self.config.discount * value
whennode.to_play == to_play is True
, why we use-node.reward + self.config.discount * value
rather thannode.reward + self.config.discount * value
here, ? -
Is it because
node.reward
is obtained from the perspective of the parent node of the currentnode
?
Looking forward to your reply!