Metaworld icon indicating copy to clipboard operation
Metaworld copied to clipboard

Spec for metaworld envrionments

Open AiRuiChen opened this issue 3 years ago • 3 comments

Currently metaworld environments inheriting from gym.Env don't have the spec attribute. This creates inconsistent behavior between metaworld envs and gym envs. It would be nice to populate this attribute, and put max_episode_length inside.

AiRuiChen avatar Aug 07 '20 22:08 AiRuiChen

Maybe building upon this remark, it seems that all environments have a step() function returning unconditionally "None" as the third variable where it is supposed to be a "done" indicator on Gym environment definition. It seems to produce inconvenient behavior when using RL frameworks like stable-baselines which are assuming such behavior of the step function. Would be more than happy to help if necessary.

perezjln avatar Dec 27 '20 15:12 perezjln

@perezjln metaworld environments never produce a termination signal (i.e. done is always False). The RL algorithm library needs to implement its own max_episode_length for sampling, and take care not to step past the maximum allowed episode length for each environment (or else metaworld will raise a RuntimeError).

If your RL algorithm library is assuming that all environments are finite-horizon OR it is assuming that infinite-horizon environments will corrupt the termination (done) signal in order to implement fixed-horizon sampling, I suggest you file a an issue with the library authors instead :). An MDP which never produces a termination signal is a perfectly-valid MDP, and there are many non-degenerate non-terminating MDPs.

Overloading the termination signal to enforce a maximum episode length (as is done in gym) corrupts the definition of an infinite-horizon MDP, and can cause off-policy algorithms to fail on these otherwise-solvable infinite-horizon problems.

How? Q-value estimation for a terminal step (done == True) usually needs to be computed differently than a non-terminal step (i.e. off-policy algorithms usually use bootstrapping to compute the target Q-value of a non-terminal). Off-policy algorithms frequently have to reach into private fields of gym environments to fix this bug, and we at metaworld would prefer to simply get it right the first time instead.

Are you sure that the third return value is None rather than False? The done signal should be a boolean. If you are encountering None

ryanjulian avatar Dec 30 '20 00:12 ryanjulian

Dear @ryanjulian, thanks very much for your detailed answer.

Maybe as a middle ground, letting the decision of an environment to be considered as a finite-horizon or infinite-horizon manner could be at the discretion of the end-user of the library, what do you think?

Maybe we will just fork and propose an update in that sense.

Regarding your point on Q-learning that I fully share, maybe it could be answered that it is more a problem of current off-policy learning algorithms than anything else :-) Furthermore, in practice, papers mainly seem to deal with finite-horizon task anyway while I fully agree that infinite-horizon MDP does exist and are worth considering.

Finally, "we at meta world would prefer to simply get it right the first time instead", once again assuming a contact-rich manipulation task can not be episodic seems more like a choice than the unilateral right thing to do, but I'm maybe wrong :-)

Once again, thanks for all your hard work with this project!

perezjln avatar Dec 30 '20 16:12 perezjln

As Meta-World ownership has been transferred from the RLWorkGroup to Farama Foundation, I am closing this issue. If there are any questions or requests for features please join our Discord

reginald-mclean avatar Feb 01 '23 15:02 reginald-mclean