imitation
imitation copied to clipboard
Infinite-Horizon Environments not Supported
Bug description
There are two ways to deal with variable horizon environments. 1) is by turning them into fixed length environments and 2) is by turning them into infinite horizon environments.
In the original DRLHP paper, they turn Atari environments into infinite horizon environments by removing all episode boundaries. I think that in many cases, this makes more sense, especially when there is long-term strategic behavior. For example, as the agent becomes more skilled at an Atari game, the true episode length may grow quite long, and drawing an arbitrary episode boundary in the middle of play will "cut-off" early actions from being linked with all rewards after the boundary.
Currently, the imitation.data.rollout.TrajectoryAccumulator.add_steps_and_auto_finish
method only returns new trajectories when a done
signal is received. Furthermore, rollout.generate_trajectories
has a while loop that waits for sufficient samples to be collected and for current episodes to finish. This means that rollout.generate_trajectories
and the rest of the codebase hangs forever when trying to collect trajectories from infinite horizon envs.
To add this functionality, one idea is to have an infinite_horizon_env
flag in each algorithm, in rollout.generate_trajectories
, and in TrajectoryAccumulator
that returns partial trajectories once they have satisfied a length requirement.
If we decide this isn't super important, and that just having very long fixed-length episodes are sufficient, I think we should at least make a note in the documentation. I am ok with this solution as well.
Steps to reproduce
import gym
import stable_baselines3 as sb3
from imitation.data import rollout
import seals
make_env = lambda: seals.util.AutoResetWrapper(gym.make("ALE/Qbert-v5"))
venv = sb3.common.vec_env.DummyVecEnv([make_env])
obs = venv.reset()
trajectories_accum = rollout.TrajectoryAccumulator()
trajectories_accum.add_step(dict(obs=obs), 0)
trajs = []
for _ in range(10000):
obs, rews, dones, infos = venv.step([env.action_space.sample()])
new_trajs = trajectories_accum.add_steps_and_auto_finish(acts, obs, rews, dones, infos)
trajs.extend(new_trajs)
At the end of this code, we still have trajs = []
.
Thanks for opening the ticket! I don't think we ever said we supported infinite-horizon environments, so I don't view lack of support as a bug. I agree the limitation should be documented better, and that it'd be nice to support this.
It's definitely doable to change our rollout code to return after a sufficient number of timesteps are reached. We basically just need to have generate_trajectories
check the number of timesteps (including unfinished ones) and call finish_trajectory
early once that threshold is crossed. I think that'd be a <5 line code change.
However, I'm not sure if this change in isolation solves much. In particular, most of our reward learning algorithms (DRLHP, AIRL, GAIL) use reinforcement learning on the inside. And how to make RL work in infinite horizon environment? Well, theoretically infinite horizon MDP is still a well-defined problem. But might cause problems for most algorithms. Certainly the library we use, Stable-Baselines3, was written with finite horizon environments in mind. I can say for sure that a bunch of the logging and evaluation (that is based on episode reward) would break. Some of the algorithms might still learn, but I wouldn't count on it.
So I think I'd want to see either a use case for this which is OK with RL not working (maybe rollouts to do something like BC?), or a proposal for how to make this play nicely with RL.
Thanks for the response! I'm relatively new to RL and didn't know that infinite horizon environments could cause issues with learning and with the standard logger and eval setup. My reasoning for opening the issue was to enable us to reproduce the setup in the original RLHF paper.
If a more concrete use-case than this comes up, I will mention it. But in light of what you said, maybe it's best to leave things as they are.
Thanks for the response! I'm relatively new to RL and didn't know that infinite horizon environments could cause issues with learning and with the standard logger and eval setup. My reasoning for opening the issue was to enable us to reproduce the setup in the original RLHF paper.
A quick look at https://github.com/nottombrown/rl-teacher/blob/master/agents/parallel-trpo/parallel_trpo/rollouts.py#L87 suggests they just cap rollouts at a certain # of timesteps and effectively treat that as an episode. I've not dug into all the code paths though so might be misunderstanding it.
Assuming my interpretation is right you could probably do something janky to replicate this, like a TimeLimitWrapper on top of AutoReset, with an additional wrapper in between that discards reset()
signals. So you send the done
signals after a fixed time period to end the rollout, but don't end up resetting the underlying environment, except at loss-of-life (which doesn't propagate to RL agent).
Personally I favor just turning everything into a fixed-length episode, seems easier to reason about, but for the purpose of replication the above would work and I'll admit I don't have a principled reason to favor one over the other.
Yes, the rl-teacher repo is in line with the RLHF paper in that they explicitly turn Mujoco environments into fixed-length episodes. As far as I can tell, the original paper only uses infinite-horizon environments on Atari (presumably because there might be more long-term planning behavior than in Mujoco envs).
But I understand your reasoning. Also, I implemented your suggestion in my code and it's working well! Once I finish replicating the original results, I'd be happy to add to the benchmarking suite and/or the examples.
Sure, do let me know how you get on.
Otherwise, I agree we should document that we only support finite-horizon environments -- I'll assign this to someone.
Closing as documented in #603