open_spiel
open_spiel copied to clipboard
on neural replicator dynamics
Hi guys, I am trying to implement the neural replicator dynamics algorithm in the paper "Neural Replicator Dynamics: Multiagent Learning via Hedging Policy Gradients", and the link of the article is https://arxiv.org/pdf/1906.00190.pdf.
The pseudo code is:
I have a few questions, and any suggestions are appreciated (1) The critic network outputs q value, not v value. Is it correct? Is there any benefits for using q value? Because the critic network outputs v value in A2C,so i think v value is more reasonable.
(2) The algorithm first sample some trajectories and then do the policy evaluation, which seems like monte carlo sampling. Is it right? It uses TD error in A2C to do the update. Is it a good idea to use TD error to do the policy evaluation?
(3) A2C and neural RD is very similar, and they both have a critic network and a policy network. There is only one difference, and the last softmax layer is cut off in the neural RD. If i am right, can i just cut off the last softmax layer in A2C and then train the rest part of the network?
Hi @linshuxi,
You may be interested in some recent PRs, we got a sample-based NeuRD contribution and a PyTorch full-width contribution, check out #892 #891.
(1) Yes, it has lower-variance. Look up "all-actions policy gradient"; there are some references in the RPG paper: https://arxiv.org/abs/1810.09026 (section 3, page 4). The V function is basically the sample-corrected version when you don't have q-based critics (see also the policy gradient chapter in Sutton & Barto).
(2) Yes correct (and we also have "full-width" versions in OpenSpiel which compute these exactly, too). The q-function can be trained using bootstrapping, yes, much like A2C. The only difference is that the networks are Q-networks (state-action networks).
(3) Yes, NeuRD shares lineage with policy gradient, hence the similarities. Correct, that is the main difference (but also this is a crucial difference for all the reasons mentioned in the paper). I guess you're suggesting replacing q_t with the sampled return, just like in REINFORCE? I suppose you can do it, sure! It will just be noisier.
If you run experiments with your v-function version of NeuRD, I'd be curious to see how it compares to the graphs in this thread: https://github.com/deepmind/open_spiel/issues/891, and maybe you can consider contributing it? I'd recommend taking a look at this thread and #892 first.
Hi @lanctot, thanks a lot for the response, and it is very useful. I will share the results when i finish the experiments.