Metaworld
Metaworld copied to clipboard
Why it does not success even when the score is high?
I tested some environment with different task settings. For example on reach-v2, I think it is strange that it does not success even when the score is 4700+. I tried to add an extra score when it success, but it didn't work. Is there any tricks to slove this problem?
Hmm reach is a pretty easy task. Are you computing your success metric correctly. There is a guide on how to do so in the paper.
Hmm reach is a pretty easy task. Are you computing your success metric correctly. There is a guide on how to do so in the paper.
Yes, reach-v2 is very easy to optimize. The score rises up fast under evolutionary algorithm. But it seems that the high score does not necessarily lead to final success.
I have similar issue. I only tested on the MT10 tasks and train a policy on each of the tasks. I run evaluation every 160k steps in the environment where each evaluation I run 50 episodes and compute the success percentage from those 50 episodes. Higher reward does not always positively correlate with higher success rate.
reach-v2
window-open-v2
(red) and window-close-v2
(green)
drawer-open-v2
(orange) and drawer-close-v2
(blue)
I trained using stable_baselines3.SAC
.
To reproduce, here is the code
from typing import Any, Dict, List, Tuple
import gym
import metaworld
import numpy as np
import os.path
import random
from gym.wrappers import TimeLimit
from metaworld.envs.mujoco.sawyer_xyz.sawyer_xyz_env import SawyerXYZEnv
from seals.util import AbsorbAfterDoneWrapper
from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.utils import get_latest_run_id
class ChangingTaskAndDoneTerminalWrapper(gym.Wrapper):
"""
Option to change task every time the environment is reset
and `done` returns `True` when `info["success"]=True`.
`done` signal is needed to reset the environment in the sb3.
Sb3 can handle `done` signal due to terminal states or time limit.
"""
def __init__(self,
env: SawyerXYZEnv,
tasks: List[metaworld.Task],
change_task: bool = False) -> None:
"""
Constructor.
:param env: Environment.
:param tasks: List of metaworld task objects.
:param change_task: Whether to change task every reset.
"""
super().__init__(env)
self.tasks = tasks
self.change_task = change_task
self.env.set_task(self.tasks[0])
def reset(self, **kwargs) -> Any:
"""
Reset the environment.
"""
if self.change_task:
task = random.choice(self.tasks)
self.env.set_task(task)
return super().reset(**kwargs)
def step(self, action: np.ndarray) -> Tuple[np.ndarray, float, bool, Dict[str, Any]]:
"""
Step in the environment.
"""
obs, rew, _, info = super().step(action)
## `sb3.EvalCallback` look at "is_success" to log success rate
info["is_success"] = info["success"]
return obs, rew, info["success"], info
def create_single_env(
env_name: str,
changing_inner_task: bool,
handle_absorb_state: bool = False) -> gym.Env:
"""
Create single gym environment.
Note that we use `seals.util.AbsorbAfterDoneWrapper` and it should be before `gym.wrappers.TimeLimit`
:param env_name: Name of the environment. Only `metaworld.MT10` tasks!
:param changing_inner_task: Whether to change inner task every reset.
:param handle_absorb_state: Whether to use `seals.util.AbsorbAfterDoneWrapper` or not.
Return:
Single gym environment.
"""
mt1 = metaworld.MT1(env_name)
env = mt1.train_classes[env_name]()
env = ChangingTaskAndDoneTerminalWrapper(env, mt1.train_tasks, changing_inner_task)
if handle_absorb_state: env = AbsorbAfterDoneWrapper(env)
env = TimeLimit(env, env.max_path_length)
return env
"""
Main function.
"""
env_name = "reach-v2"
changing_inner_task = True
seed = 0
device = "cuda:0"
## set log directory
tb_logdir = "logs/%s" % env_name
tb_logname = "sac"
if changing_inner_task: tb_logdir = tb_logdir + "/changing_goal"
else: tb_logdir = tb_logdir + "/static_goal"
latest_run_id = get_latest_run_id(tb_logdir, tb_logname)
## create env
vecenv = make_vec_env(
env_id = lambda: create_single_env(env_name, changing_inner_task),
n_envs = 8,
seed = seed
)
eval_vecenv = make_vec_env(
env_id = lambda: create_single_env(env_name, changing_inner_task),
n_envs = 10,
seed = seed + 1
)
## create algo
algo = SAC(
policy = "MlpPolicy",
env = vecenv,
learning_rate = 0.0003,
buffer_size = int(1e6),
learning_starts = int(1e3),
batch_size = 512,
tau = 0.005,
gamma = 0.99,
train_freq = 512, ## train every `train_freq * vecenv.num_envs`
gradient_steps = 128,
policy_kwargs = {"net_arch": {"pi": [256, 256], "qf": [256, 256]}},
tensorboard_log = tb_logdir,
device = device,
seed = seed,
verbose = 1
)
## create callback
eval_callback = EvalCallback(
eval_env = eval_vecenv,
n_eval_episodes = 50,
eval_freq = 20000, ## eval every `eval_freq * vecenv.num_envs`
best_model_save_path = os.path.join(tb_logdir, "%s_%d" % (tb_logname, latest_run_id + 1)),
log_path = os.path.join(tb_logdir, "%s_%d" % (tb_logname, latest_run_id + 1)),
)
algo.learn(
total_timesteps = int(1e8),
callback = eval_callback,
tb_log_name = tb_logname
)
Same issue. even in fixed goal environment.
I found the culprit.
- The
metaworld.Benchmark
object should only be created once. Recreating the benchmark object will sample a new tasks (ie. start/object/goal positions). In my code, I call thecreate_single_env
multiple times hence themetaworld.MT1
is created multiple times. It will mess up your evaluation. - Try to not stop the algorithm even though it successfully reach the goal ie. dont stop when
info["success"] == True
. Somehow it destabilizes training.Orange: I let the environment runs until 500 timesteps even though it successfully reaches the goal in the middle of the episode. Blue: I stop it when
info["success"] == True
.
Yeah, every time you create a benchmark object it samples 50 new goal locations. That's intended behavior, so I'm going to close this. I have added another bug for the button-press environments, which do genuinely have this problem.
Also worth mentioning is that the training instability when success is achieved is know, which is one reason why success states are not terminal states. Adding a wrapper to make them terminal states is not recommended.
Maybe this could be an additional insight, I guess.
I just trained single task with termination when it successes(info["success"]==True) using SAC, I found the same situation. By rendering, it shows that the agent stops right before goal and just remaining still.
This may be because it is obviously the best way to get maximum 'return', gaining more rewards(between 0 and 10) until max_path_length rather than just terminating with getting single reward 10.
@reginald-mclean is there any point to modifying the reward function such that this behaviour is not optimal
Yeah there could be some value there. I would just be concerned about what that reward is to avoid this behaviour, tuning that reward function would mean giving a very large value to the agent when success=True and that large value could destabilize value function training.
One thing to note is that Meta-World intentionally does not terminate the episode on success, to avoid this problem. You should always use length 500 episodes, without any terminal states. Formally, every meta-world task is an infinite horizon MDP, from which a finite 500 state sequence is sampled for each episode.
To reliably avoid this problem, the reward on termination would need to be two orders of magnitude larger than at any other timestep (technically, 5000 vs 10), which would destabilize most value function training.