rlpyt
rlpyt copied to clipboard
Can custom environments return different namedtuples at different steps for the last element of the tuple returned by Env.step?
Hi Alex,
Question is in the title. Basically, I have an environment where I need to record a variable number of additional pieces of information about the environment, and on most steps, I want to record nothing.
The most convenient way to do this seems to be return None when I don't want to record any additional info and to record a namedtuple containing only those factors that I want to record at other steps.
Backing up, I think my main problem is that I'm having a lot of trouble figuring out where and when this info is logged.
At a higher level, the problem is that the infrastructure you built for logging additional information about a trajectory doesn't align with my environment particularly well. The reason is that there's a variable amount of stuff that can occur between queries for agent actions. For example, the opponent may take many turns or none.
Ideally, I would be able to store this info in the environment as it's generated and write it to a file when the environment process is closed. Do you have any suggestions for how to implement something like this?
Hi, good questions! Unfortunately, yes the env_info
infrastructure is not well-suited to your case right now, reason being that all the memory for the parallel sampler is pre-allocated and remains a fixed shape for every time-step of the environment.
A few options:
- Use something like the
EnvInfoWrapper
around your environment to make it output the same keys at every step (maybe you don't need all the machinery at the link, and just hard-code it): https://github.com/astooke/rlpyt/blob/a0f1c3045eac1b12d6305b35200139f9ee2a63cd/rlpyt/envs/gym.py#L131
You basically set a pre-defined default value which matches the dtype and shape of the value when it appears.
But if you have wildly variable types of data coming out through env_info, then... 2) Another thing you could use is the trajectory info, which stores whatever you want on a per environment, per trajectory basis. This does not use pre-allocated memory, and is sent to the master process whenever a trajectory ends. The trajectory info is grabbed immediately after the environment steps, for example here: https://github.com/astooke/rlpyt/blob/a0f1c3045eac1b12d6305b35200139f9ee2a63cd/rlpyt/samplers/parallel/cpu/collectors.py#L43
So you could write a custom TrajInfo
class that pops out of env_info
any fields which aren't appropriate for the pre-allocated buffer. (Hacky bit is that you need to make sure that the first time your env.step()
is called, it outputs only the appropriate keys and values to use as examples for the preallocation of the buffer.)
You might want to do something other than dump all the resulting traj_info
into tabular form, which happens here:
https://github.com/astooke/rlpyt/blob/a0f1c3045eac1b12d6305b35200139f9ee2a63cd/rlpyt/runners/minibatch_rl.py#L212
Let us know if either of these help! Or if you have other questions chasing down these bits.
p.s. my name is Adam ;) appreciate the non-username touch tho :)
@bpiv400 Curious if any of this ended up working?
Another option I forgot to mention is that if the SerialSampler is fast enough for you, you actually don't need pre-allocated buffers in that case, and you could have variable-sized arrays. Just would take a bit of re-writing a Collector to construct new arrays at each iteration.
@astooke I'm not sure yet. I tabled this while working on a few other parts of our project. I'll let you know (and may have more questions) when I revisit this in the next couple weeks.