skrl icon indicating copy to clipboard operation
skrl copied to clipboard

Add Model-Based Meta-Policy-Optimization (MBMPO)

Open juhannc opened this issue 3 years ago • 4 comments

Add Model-Based Meta-Policy-Optimization (MBMPO)

Introduction and description

Coming soon

Improvements in this PR

Coming soon

Proof of Work

Coming soon

Cheers,

Johann

juhannc avatar Oct 05 '22 08:10 juhannc

Hi @Toni-SM, for my work I have to implement MBMPO. But I wanted your opinion on something.

First, some words about MBMPO in case you are not familiar with it. MBMPO being a model-based algorithm, it uses a learned model to train the policy, and, as far as I understand, uses TRPO to train the (meta-)policy. But, the idea could probably be generalized to use other algorithms for the policy. Thus, I could either hard-code TRPO into the agent or pass another agent to the MBMPO.

The later would increase the flexibility but also would require a somewhat different __init__ function, something like:

class MBMPO(Agent):
    def __init__(self,
        models: Dict[str, Model],
+       agent: Agent,
        memory: Optional[Union[Memory, Tuple[Memory]]] = None,
        observation_space: Optional[Union[int, Tuple[int], gym.Space]] = None,
        action_space: Optional[Union[int, Tuple[int], gym.Space]] = None,
        device: Union[str, torch.device] = "cuda:0",
        cfg: Optional[dict] = None) -> None:

What's your take on that?

juhannc avatar Oct 05 '22 08:10 juhannc

The idea about generalization for other agents looks good... Whenever it is possible to avoid modifying the arguments of the agent constructors well. But there are cases where it is necessary as in AMP, or in the solution you propose for this agent

Toni-SM avatar Oct 05 '22 08:10 Toni-SM

Thanks for the quick feedback, will go that route then

juhannc avatar Oct 05 '22 08:10 juhannc

By the way, the current TRPO implementation iterates through learning epochs for both, the policy and the value... When it should be only for value ( the optimization of the policy should be excluded from this loop).

Toni-SM avatar Oct 05 '22 08:10 Toni-SM