rlpyt
rlpyt copied to clipboard
Understanding batch_T
Hi,
batch_T (int) – number of time-steps per sample batch
I don't understand the effect of batch_T
in samplers. I see another batch_T
in R2D1 too. So what is the difference? What is the relation between them and how we should set these two values? And also batch_B
values for R2D1 and its sampler?
I want to understand the effect of this parameter, batch_T
, especially in recurrent algos such as R2D1 and PPO_LSTM. Does it affect the memory/history information that the LSTM can learn/memorize? Based on the code, the agent uses a trajectory of size batch_T
to train LSTM, so it can limit the time horizon the network can memorize info. So it should be set to average trajectory size of the env, based on each env, right?
Thank you
My understanding on batch_T: "how many time steps to walk in an environment when collecting data". I got this understanding from CpuResetCollector, function collect_batch():
for t in range(self.batch_T): ...
there are two "for loop" levels in this section of code, the 1st is walking in an environment, the 2nd is iterating in all the environments.
Not sure my understanding is right or not, just for your reference.
@codelast yes you are right. It is important for recurrent agents. What I understood is that:
- in PPO LSTM it samples batch_B trajectories with the length of Batch_T and uses all of batch_B trajectories for training.
- in R2D1 for sampling it creates batch_B envs and sample batch_T steps from each one (not sure about rnn state, based on the documentation it should store rnn_state based on the interval, starting from the beginning??). And the batch_T and batch_B in R2D1 are for calculating batch_size= (batch_T + warmup_T)*batch_B. But in the for loop for training when it samples from the buffer, I see the shape of the observation, for example is sth different from done. for Batch_T=32, batch_B=3, warmup_T=40, the shape of observation is [77,3,x] and for done is [72, 3] and the batch_size is (40+32)*3=216. I don't understand the difference between 77 and 72.
And also the done value in the sampled batch. It is True until some steps and then False for some trajectories. For PPO LSTM it was always False in the beginning and then True which is understandable. For R2D1, based on documentation, it seems the sampling should start from a time-step with stored rnn_state.
So the steps with done=True will not be used for training?
how the recurrent replay saves data? It seems it should be sequentially and then randomly sample a starting point and take the trajectory with batch_T+warmup_T length? So why the done flag is True for several steps? does it save batch_T (from sampler) steps without caring if it is done or not?
@astooke Any chance to get a short explanation about the discrepancy between batch_T+warmup_T and the actual sample dimension? For batch_T=2
, warmup_T=40
, I get [49,batch_B,3,84,84]
dimensions of sample.
Hi, good questions!
To clarify some earlier questions...in the policy gradient algorithms, like PPO, there is only the sampler's batch_T and batch_B, and then whatever the sampler returns in one iteration forms the minibatch for the algorithm. In replay-based algorithms like DQN, there is still the batch_T and batch_B of the sampler, which keep the same meaning as the amount of data collected per iteration. But these algorithms also have their own batch_size--or in the case of R2D1 batch_T and batch_B--to determine how much data is replayed from the buffer for each training minibatch.
Regarding done=True
for multiple timesteps, yes that is because when and environment episode ends during sampling, the environment might not reset until the beginning of the following sampling batch, so that the start of an episode aligns with the interval for storing the RNN state. But in the meantime, all the (dummy) data from the inactive environment still gets written to the replay buffer. Populating done=True
for all those steps makes it obvious where the new episode actually begins in the buffer, which is the first new step where done=False
. And if you look at the valid_from_done()
function which generates the mask for the RNN, it masks out all data after the first done=True
, so it's ok to have more done=True
after that. Kind of a long explanation, but does that make sense?
@bmazoure The discrepancy between the length of the observations returned is because it also includes the target observations, which extend out to n steps past the agent observations, for n-step returns: https://github.com/astooke/rlpyt/blob/668290d1ca94e9d193388a599d4f719bc3a23fba/rlpyt/replays/sequence/n_step.py#L88
Then inside the R2D1 algorithm it moves the one copy of the whole observation set to the GPU once, and then creates sliced views to this data for the agent inputs and target inputs. R2D1 default n_step_return is 5, so that should add up. Sorry that's a tricky one!
@astooke thank you for your explanation. I have some other questions. 1- Regarding the effect of Batch_T in sampling from the environment in CpuSampler and in R2D1 to sample data from the replay buffer, I do not understand the concept clearly. I mean when the sampler wants to start to interact with the env, it starts from the beginning and if we set batch_T to sth small, for example, 32, it cannot see the states after this time step, the final states of one episode in a game with long episode. For sampling from replay buffer, it is ok because the sampled sequence can start from everywhere with stored rnn_state, but for sampling from env, my understanding is that the batch_T can affect training. Is it correct? If so, how we have to set it? average of episode length or sth else? 2- in sampling from env, we have several instances of environment and the sampler starts to sample until batch_T. What happens in the next round of sampling? Does every env start again from scratch or continue from the final state of previous sampling round? If it starts from scratch, then the LSTM every time is initialized with zero, right? I'm comparing the initialization of hidden state of rnn with R2D1 which uses stored rnn_state+warmup. How is this initialization in PPO_LSTM. What I understood from the PPO code, it is not initialized with non-zero values.
Update: I explored the code more and found the reset_if_needed()
method here. So it seems that the collector will reset env if it is done. So if, for example batch_T=10, and the env is not done, in the next sampling period it will continue from the previous state. That's my understanding.
Hi! That is correct, the environment state carries forward to the next sampling batch. The environment only resets when an episode finishes, even if the sampler batch_T
is much shorter than this. So the sampler's batch_T
should have small-to-no effect on training, whereas the algorithm's batch_T
can have a large effect, because this the is the length of LSTM backprop-through-time for training.
Hope that helps!