rlpyt
rlpyt copied to clipboard
Handling Early Resets in Procgen Envs
I've been trying to set up a multiworker Rainbow DQN baseline for procgen, similar to what's described in Leveraging Procedural Generation for Benchmarking Reinforcement Learning. This is roughly how I'm handling the setup (based on example 7):
def make_env(game, num_levels, distribution_mode):
class RlpytProcgenWrapper(gym.Wrapper):
"""
Handle issues with procgen seeding and image axis order
"""
def step(self, *args):
o, r, d, i = self.env.step(*args)
return np.transpose(o, (2, 0, 1)), r, d, i
def reset(self):
return np.transpose(self.env.reset(), (2, 0, 1))
def seed(self, seed):
return
env = gym.make(f"procgen:procgen-{game}-v0", num_levels=num_levels, distribution_mode=distribution_mode)
env = RlpytProcgenWrapper(env)
return env
args = parse_args()
affinity = make_affinity(
run_slot=args.run_slot,
n_gpu=args.n_gpu,
n_cpu_core=args.n_cpu_core,
gpu_per_run=args.gpu_per_run,
)
sampler = GpuSampler(
EnvCls=GymEnvWrapper,
env_kwargs=dict(env=make_env(args.game, args.num_levels, args.distribution_mode)),
TrajInfoCls=TrajInfo,
batch_T=args.batch_T,
batch_B=args.batch_B,
CollectorCls=GpuResetCollector,
max_decorrelation_steps=args.max_decorrelation_steps,
eval_n_envs=10,
eval_env_kwargs=dict(env=make_env(args.game, args.eval_num_levels, args.eval_distribution_mode)),
eval_max_steps=int(10e5),
eval_max_trajectories=20,
)
algo = CategoricalDQN(...)
num_actions = make_env(args.game, args.num_levels, args.distribution_mode).action_space.n
agent = CatDqnAgent(n_atoms=args.n_atoms, eps_final=args.eps_final,
ModelCls=AtariCatDqnModel, model_kwargs={'image_shape':(3, 64, 64), 'output_size':num_actions, 'dueling':True})
runner = SyncRlEval(
algo=algo,
agent=agent,
sampler=sampler,
n_steps=args.n_steps,
log_interval_steps=1e4,
affinity=affinity,
)
config = vars(args)
name = f"rainbow_{args.game}_{args.distribution_mode}"
log_dir = f"{args.game}"
with logger_context(log_dir, args.run_ID, name, config, snapshot_mode="last", override_prefix=True, use_summary_writer=True):
runner.train()
Everything seems to run fine, but I get a lot of "Warning: Early Reset Ignored" messages from the procgen env, because they don't normally allow resets before the trajectory is finished. What is the best way to handle that with rlpyt? I've tried using different samplers and Gpu/CpuWaitResetCollector, but no luck.
OK interesting...the environment should only be reset when the done
signal comes out True
. Does this happen for a procgen env before the trajectory is finished?
Possibly related, the Atari env has some logic to do with episodic lives, where for RL we want to consider the episode done, but we don't reset the environment. First, the AtariEnv can output done=True
but also env_info(traj_done=False)
https://github.com/astooke/rlpyt/blob/85d4e018a919118c6e42fac3e897aa346d84b9c5/rlpyt/envs/atari/atari_env.py#L127-L129
Second, inside the collector, the environment doesn't get reset if done=True
but env_info["traj_done"]
is found and is False
(defaults to the value of done
if traj_done
is not found). If done=True
, the agent still gets reset regardless, in case it is recurrent, based on RL episodes:
https://github.com/astooke/rlpyt/blob/85d4e018a919118c6e42fac3e897aa346d84b9c5/rlpyt/samplers/parallel/cpu/collectors.py#L45-L50
Of course you can change this logic anyway needed for procgen.
Let us know if any of this helps, and where the early resets were coming from?