PyTorch-Counterfactual-Multi-Agent-Policy-Gradients-COMA
PyTorch-Counterfactual-Multi-Agent-Policy-Gradients-COMA copied to clipboard
Leverage the use of recurrent modules
Hi, I am wondering if the oscillation of the training phase comes from the fact that you only include down-sampling layers in your actor nets, since in partially observable domains, the information state of agents should be the whole line of history instead of the current one-shot observation. Thus, to include a recurrent module like LSTM or GRU might be helpful.