Metaworld icon indicating copy to clipboard operation
Metaworld copied to clipboard

Why it does not success even when the score is high?

Open alexxchen opened this issue 1 year ago • 2 comments

I tested some environment with different task settings. For example on reach-v2, I think it is strange that it does not success even when the score is 4700+. I tried to add an extra score when it success, but it didn't work. Is there any tricks to slove this problem?

alexxchen avatar Aug 14 '22 06:08 alexxchen

Hmm reach is a pretty easy task. Are you computing your success metric correctly. There is a guide on how to do so in the paper.

avnishn avatar Aug 24 '22 18:08 avnishn

Hmm reach is a pretty easy task. Are you computing your success metric correctly. There is a guide on how to do so in the paper.

Yes, reach-v2 is very easy to optimize. The score rises up fast under evolutionary algorithm. But it seems that the high score does not necessarily lead to final success.

alexxchen avatar Aug 30 '22 06:08 alexxchen

I have similar issue. I only tested on the MT10 tasks and train a policy on each of the tasks. I run evaluation every 160k steps in the environment where each evaluation I run 50 episodes and compute the success percentage from those 50 episodes. Higher reward does not always positively correlate with higher success rate.

reach-v2 image

window-open-v2 (red) and window-close-v2 (green) image

drawer-open-v2 (orange) and drawer-close-v2 (blue) image

I trained using stable_baselines3.SAC.

gunnxx avatar Nov 09 '22 11:11 gunnxx

To reproduce, here is the code

from typing import Any, Dict, List, Tuple

import gym
import metaworld
import numpy as np
import os.path
import random

from gym.wrappers import TimeLimit
from metaworld.envs.mujoco.sawyer_xyz.sawyer_xyz_env import SawyerXYZEnv
from seals.util import AbsorbAfterDoneWrapper
from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.utils import get_latest_run_id


class ChangingTaskAndDoneTerminalWrapper(gym.Wrapper):
    """
    Option to change task every time the environment is reset
    and `done` returns `True` when `info["success"]=True`.

    `done` signal is needed to reset the environment in the sb3.
    Sb3 can handle `done` signal due to terminal states or time limit.
    """

    def __init__(self,
        env: SawyerXYZEnv,
        tasks: List[metaworld.Task],
        change_task: bool = False) -> None:
        """
        Constructor.

        :param env: Environment.
        :param tasks: List of metaworld task objects.
        :param change_task: Whether to change task every reset.
        """
        super().__init__(env)
        self.tasks = tasks
        self.change_task = change_task
        self.env.set_task(self.tasks[0])
    

    def reset(self, **kwargs) -> Any:
        """
        Reset the environment.
        """
        if self.change_task:
            task = random.choice(self.tasks)
            self.env.set_task(task)

        return super().reset(**kwargs)
    

    def step(self, action: np.ndarray) -> Tuple[np.ndarray, float, bool, Dict[str, Any]]:
        """
        Step in the environment.
        """
        obs, rew, _, info = super().step(action)
        ## `sb3.EvalCallback` look at "is_success" to log success rate
        info["is_success"] = info["success"]
        return obs, rew, info["success"], info


def create_single_env(
    env_name: str,
    changing_inner_task: bool,
    handle_absorb_state: bool = False) -> gym.Env:
    """
    Create single gym environment.
    Note that we use `seals.util.AbsorbAfterDoneWrapper` and it should be before `gym.wrappers.TimeLimit`

    :param env_name: Name of the environment. Only `metaworld.MT10` tasks!
    :param changing_inner_task: Whether to change inner task every reset.
    :param handle_absorb_state: Whether to use `seals.util.AbsorbAfterDoneWrapper` or not.

    Return:
        Single gym environment.
    """
    mt1 = metaworld.MT1(env_name)
    env = mt1.train_classes[env_name]()
    env = ChangingTaskAndDoneTerminalWrapper(env, mt1.train_tasks, changing_inner_task)
    if handle_absorb_state: env = AbsorbAfterDoneWrapper(env)
    env = TimeLimit(env, env.max_path_length)
    return env


"""
Main function.
"""

env_name = "reach-v2"
changing_inner_task = True
seed = 0
device = "cuda:0"

## set log directory
tb_logdir = "logs/%s" % env_name
tb_logname = "sac"

if changing_inner_task: tb_logdir = tb_logdir + "/changing_goal"
else: tb_logdir = tb_logdir + "/static_goal"

latest_run_id = get_latest_run_id(tb_logdir, tb_logname)

## create env
vecenv = make_vec_env(
    env_id  = lambda: create_single_env(env_name, changing_inner_task),
    n_envs  = 8,
    seed    = seed
)

eval_vecenv = make_vec_env(
    env_id  = lambda: create_single_env(env_name, changing_inner_task),
    n_envs  = 10,
    seed    = seed + 1
)

## create algo
algo = SAC(
    policy          = "MlpPolicy",
    env             = vecenv,
    learning_rate   = 0.0003,
    buffer_size     = int(1e6),
    learning_starts = int(1e3),
    batch_size      = 512,
    tau             = 0.005,
    gamma           = 0.99,
    train_freq      = 512,              ## train every `train_freq * vecenv.num_envs`
    gradient_steps  = 128,
    policy_kwargs   = {"net_arch": {"pi": [256, 256], "qf": [256, 256]}},
    tensorboard_log = tb_logdir,
    device          = device,
    seed            = seed,
    verbose         = 1
)

## create callback
eval_callback = EvalCallback(
    eval_env                = eval_vecenv,
    n_eval_episodes         = 50,
    eval_freq               = 20000,    ## eval every `eval_freq * vecenv.num_envs`
    best_model_save_path    = os.path.join(tb_logdir, "%s_%d" % (tb_logname, latest_run_id + 1)),
    log_path                = os.path.join(tb_logdir, "%s_%d" % (tb_logname, latest_run_id + 1)),
)

algo.learn(
    total_timesteps = int(1e8),
    callback        = eval_callback,
    tb_log_name     = tb_logname
)

gunnxx avatar Nov 10 '22 05:11 gunnxx

Same issue. even in fixed goal environment.

seolhokim avatar Nov 11 '22 07:11 seolhokim

I found the culprit.

  1. The metaworld.Benchmark object should only be created once. Recreating the benchmark object will sample a new tasks (ie. start/object/goal positions). In my code, I call the create_single_env multiple times hence the metaworld.MT1 is created multiple times. It will mess up your evaluation.
  2. Try to not stop the algorithm even though it successfully reach the goal ie. dont stop when info["success"] == True. Somehow it destabilizes training. Screenshot from 2022-11-11 21-48-07 Orange: I let the environment runs until 500 timesteps even though it successfully reaches the goal in the middle of the episode. Blue: I stop it when info["success"] == True.

gunnxx avatar Nov 28 '22 13:11 gunnxx

Yeah, every time you create a benchmark object it samples 50 new goal locations. That's intended behavior, so I'm going to close this. I have added another bug for the button-press environments, which do genuinely have this problem.

krzentner avatar Jan 26 '23 19:01 krzentner

Also worth mentioning is that the training instability when success is achieved is know, which is one reason why success states are not terminal states. Adding a wrapper to make them terminal states is not recommended.

krzentner avatar Jan 26 '23 20:01 krzentner

Maybe this could be an additional insight, I guess.

I just trained single task with termination when it successes(info["success"]==True) using SAC, I found the same situation. By rendering, it shows that the agent stops right before goal and just remaining still.

This may be because it is obviously the best way to get maximum 'return', gaining more rewards(between 0 and 10) until max_path_length rather than just terminating with getting single reward 10.

goldbird5 avatar Jul 27 '23 02:07 goldbird5

@reginald-mclean is there any point to modifying the reward function such that this behaviour is not optimal

pseudo-rnd-thoughts avatar Jul 29 '23 20:07 pseudo-rnd-thoughts

Yeah there could be some value there. I would just be concerned about what that reward is to avoid this behaviour, tuning that reward function would mean giving a very large value to the agent when success=True and that large value could destabilize value function training.

reginald-mclean avatar Jul 29 '23 21:07 reginald-mclean

One thing to note is that Meta-World intentionally does not terminate the episode on success, to avoid this problem. You should always use length 500 episodes, without any terminal states. Formally, every meta-world task is an infinite horizon MDP, from which a finite 500 state sequence is sampled for each episode.

To reliably avoid this problem, the reward on termination would need to be two orders of magnitude larger than at any other timestep (technically, 5000 vs 10), which would destabilize most value function training.

krzentner avatar Jul 31 '23 23:07 krzentner