Multi-Agent-Transformer
Multi-Agent-Transformer copied to clipboard
Question about the monotonic improvement guarantee of MAT.
Very great work!
I am very interest why MAT can hold the monotonic improvement guarantee while avoids sequential updates.
To guarantee the monotonic improvement, HAPPO updates each policy one-by-one during training, by leveraging previous update results. That means if we want to update ${\pi}^2_{old}$, we have to wait ${\pi}^1_{new}$.
There is only a rough discussion about this issue in the paper:
After careful checking the HAPPO paper, I found MAT's Eq 5 is not the same as Eq 11 in HAPPO paper. Specifically, MAT's Eq 5 ignores the first term of $M^{i_{1:m}}$ which depends on previous update results, e.g., ${\pi}^1_{new}$.
Can you explain why Eq.5 can guarantee monotonic improvement ?
This question has been bothering me for a long time and I look forward to getting your reply.