rezunli96
rezunli96
I encountered the same problem. Have you solved it?
I also added an example using IS-MCTS as BR in the latest PR. It was implemented similarly as in PSRO-MCTS.
Is it because there is a jitted function that calls BestResponsePolicy? If it is then I think it is normally not solvable because the execution flow of a jitted JAX...
Hi, pathfinding seems to be a good one. You can actually specify `kExampleMultiAgentGrid` to customize the grid you want. This game is general-sum, e.g., see [this paper](https://www.jmlr.org/papers/volume4/hu03a/hu03a.pdf)
It just occurred to me that the sampled trajectory is an unbiased estimator of the MF-Value? It works for REINFORCE-like AC. But still confused how to calculated for off-policy RL...
Hi @thorsten-j do you know which version of Mahjong Gravon has? The one I am currently looking into is Riichi Mahjong.
@thorsten-j Sounds great! Then I think we do not really have a conflict here and we can work on our own versions of Majong separately. Maybe there is a unified...
Hi sorry for the late reply and I just noticed -- seems like the external version still has this issue