Chen Fang
Chen Fang
Since you want to use the algorithm of MARLlib, I guess you may need to override the abstract class `MultiAgentEnv` provided by ray, or write a wrapper for the algorithm...
You can try overriding the `postprocess_fn` in PPOTOrchPolicy. In the MARLlib, such an example may be: https://github.com/Replicable-MARL/MARLlib/blob/368c6173577d0f9c0ad70fb5b4b6afa12c864c15/marllib/marl/algos/core/CC/coma.py#L116-L125 the signature for `postprocess_fn` is fixed, which is ``` postprocess_fn( policy: Policy, sample_batch:...
@nikhil-pitta it is called before both the policy gradient and value function gradient. The pipeline is basically: extra_action_out_fn → postprocess_fn → loss_fn → compute_gradients → apply_gradients.
@nikhil-pitta You mentioned "augment our current step/collected experiences and add to the replay buffer", and it sounds exactly what `postprocess_fn` do, as our earlier discussion. This extra function applies to...
@nikhil-pitta Note that `JointQPolicy` inherits the `Policy` class, which has a function `postprocess_trajectory`. https://github.com/ray-project/ray/blob/55fc0710d8472a9abaf244ed6567eb3b13136531/rllib/policy/policy.py#L361-L366. Directly overriding this function may help.
I remember joint Q learning supports `share_policy=all`, you can see the related logic at here. https://github.com/Replicable-MARL/MARLlib/blob/368c6173577d0f9c0ad70fb5b4b6afa12c864c15/marllib/marl/algos/run_vd.py#L105-L118 Try to adapt the code under this setting.