rlpyt
rlpyt copied to clipboard
How to adjust epsilon (in epsilon greedy) on a per-episode basis, in parallel?
Hi Adam,
Thanks again for getting rlpyt set up.
I am wondering if it is possible to do this when running RL in parallel: within each parallel environment, at the beginning of each new episode, we draw epsilon from a desired distribution (e.g., a uniform distribution between 0.0 and 1.0).
There seem to be two main problems to tackle:
- How to get independent epsilons set up per parallel environment.
- How to enforce a change in epsilons after each episode, within each parallel environment.
I am looking at the epsilon greedy code:
https://github.com/astooke/rlpyt/blob/master/rlpyt/agents/dqn/epsilon_greedy.py
and it seems to expose some functionality that could be used. In particular, it looks like there are vector-valued epsilons, which are of length equal to the number of parallel environments, and I may be able to use this method for setting epsilons:
https://github.com/astooke/rlpyt/blob/a865f6712a049f9fd26500e924114e9582a6a5c2/rlpyt/agents/dqn/epsilon_greedy.py#L82-L83
(It's unclear, though, if that epsilon can be allowed to be a vector or if I need to have a scalar there, while the code abstracts away the parallelism?)
In addition, I am wondering about the way that I can detect for finished episodes and then change the epsilon from there. It seems like I need to dig deep into the code. The main high-level abstraction for training happens in code like this:
https://github.com/astooke/rlpyt/blob/a865f6712a049f9fd26500e924114e9582a6a5c2/rlpyt/runners/minibatch_rl.py#L252-L263
However, given that one single itr will result in a batch of samples, which may include finished episodes, I may need to look at the sampler code. This leads to this in the base class which the subclasses (GPU, etc.) will inherit
https://github.com/astooke/rlpyt/blob/a865f6712a049f9fd26500e924114e9582a6a5c2/rlpyt/samplers/parallel/base.py#L101-L113
I am not quite sure how to proceed. It suggests that each worker does its own thing, but I don't know where to detect if an episode finished (not a life but an episode if we consider games with multiple lives).
I'm curious if you have thoughts on the way to implement this? Perhaps it is easier to deal directly with the epsilon greedy class and write code that can detect a finished episode there?
Hi! Sorry for the slow response, but this is an interesting idea!
You are right that the epsilon can be either a scalar or a vector of length equal to the number of parallel environments. Since the epsilon is handled in the agent, I think the easiest way to implement your idea would be to customize the agent's reset_one()
method: https://github.com/astooke/rlpyt/blob/a865f6712a049f9fd26500e924114e9582a6a5c2/rlpyt/agents/base.py#L269
That method is called within the collector, e.g.: https://github.com/astooke/rlpyt/blob/a865f6712a049f9fd26500e924114e9582a6a5c2/rlpyt/samplers/parallel/cpu/collectors.py#L50 whenever and environment is reset (end of episode), and it includes an argument that tells the agent which environment it was. So you could could get the current epsilon vector from the agent's distribution: https://github.com/astooke/rlpyt/blob/a865f6712a049f9fd26500e924114e9582a6a5c2/rlpyt/distributions/epsilon_greedy.py#L25 and then change the value at the corresponding index.
Seem like that will work?
Hi Daniel and Adam! @DanielTakeshi @astooke (I'm the undergrad working on Daniel's project) I'm following up on this issue and hopefully find an easy fix. I read both of your comments, and I think Adam pointed out a good place to check whether an episode is done: https://github.com/astooke/rlpyt/blob/a865f6712a049f9fd26500e924114e9582a6a5c2/rlpyt/samplers/parallel/cpu/collectors.py#L50
However, this agent.reset_one(b)
can't be found in the GPU sampler.collector, whereas the nearby code look pretty much the same, I'm wondering if there's a reason for that or just because this is an unused detail? see https://github.com/astooke/rlpyt/blob/a865f6712a049f9fd26500e924114e9582a6a5c2/rlpyt/samplers/parallel/gpu/collectors.py#L34 and
https://github.com/astooke/rlpyt/blob/a865f6712a049f9fd26500e924114e9582a6a5c2/rlpyt/samplers/parallel/cpu/collectors.py#L45
Another question for both @DanielTakeshi and @astooke is about end of a life in one play vs end of a game that leads to env reset. Daniel wanted to "redraw epsilon at the start of every new episode", does that mean a new epsilon is needed after every life, even without env reset? Adam's docstring here
https://github.com/astooke/rlpyt/blob/a865f6712a049f9fd26500e924114e9582a6a5c2/rlpyt/samplers/parallel/cpu/collectors.py#L15
is a little confusing to me: from the code logic
https://github.com/astooke/rlpyt/blob/a865f6712a049f9fd26500e924114e9582a6a5c2/rlpyt/samplers/parallel/cpu/collectors.py#L45
looks like the value d
only indicates whether a life is lost, but you need "traj_done" to reset env?
Finally, I think it's easier to implement this epsilon-redraw process on the agent side, but detect episode-done in sampler.collecter, and call collector.agent.redraw()
there. I'll keep working on this and update my progress. Thanks!!
Ah yes, in the GPU case, the agent lives in the master process, not in the sampler worker. So you have to dig elsewhere for the agent.reset_one()
call: https://github.com/astooke/rlpyt/blob/668290d1ca94e9d193388a599d4f719bc3a23fba/rlpyt/samplers/parallel/gpu/action_server.py#L71
From there, the logic should be the same.
As for the if getattr(env_info, "traj_done", d):
, that's a correct reading. If the environment provides a "traj_done"
signal, the collector defers to that, else it just uses the normal done
signal, which is what most environments will use.
Hope that helps!
Thanks @MandiZhao and @astooke . Indeed if getattr(env_info, "traj_done", d) = True
, then for games that have lives (e.g., Breakout) that will trigger when the agent loses all lives and has to do a "real" reset. But, as Breakout has lives, each time the agent loses a life that is NOT the final life, then getattr(env_info, "traj_done", d) = False
but we get d=True
from env.step(action)
because there is still a partial reset usually, in Breakout it's when the particle resets to the center. For a game like Pong with one "life" there is no distinction between the two cases.
@astooke I am thinking of modifying the self.agent.reset_one(idx=b)
method in a subclass so that it will take in an extra argument, truly_done
. That way we can call the method with this pseudocode:
truly_done = getattr(env_info, "traj_done", d)
self.agent.reset_one(idx=b, truly_done=truly_done) ?
Then we can reset our epsilon only if truly_done=True
. Alternatively, if our agents/algos don't implement anything in agent.reset_one()
we could avoid triggering the call to this method entirely unless we're "truly" done.
Hi @astooke I think I am able to get this implemented assuming the CPU sampler. I'm currently doing this after every "done" but it should be easy to change it to every "true episode completion." For now let's just re-draw epsilon from a uniform distribution after each done signal.
I cloned the repository from today, installed, and made these git diff changes, putting them in pastebin: https://pastebin.com/a1uJYACR (It looks like there are more changes than there really are but some of these are because my vim will strip away excess whitespace.) To summarize, here is how I'm testing:
-
I'm using
examples/example_5.py
which runs DQN, except: (a) I use a CPU sampler for now, (b) I change to MinibatchRl to reduce the need to consider an "eval" epsilon, and (c) I adjust T and B. For example, in the git diff I have T=4 and B=2. There is also n_parallel=2. This means that each worker will be allocated one of the B=2 parallel environments that I requested. -
Inside the DQN agent class: https://github.com/astooke/rlpyt/blob/668290d1ca94e9d193388a599d4f719bc3a23fba/rlpyt/agents/dqn/dqn_agent.py#L18-L22
I'm defining a new method:
+ def reset_one(self, idx):
+ """Testing if we can change the epsilon.
+
+ The `idx` represents the parallel env, within this worker process.
+ Therefore, if there are k parallel workers, each with one env (for k
+ parallel envs), then idx must be equal to 0. But there are cases where
+ if the num of parallel envs is higher than available workers, we have
+ multiple envs per worker.
+ """
+ eps = self.distribution.epsilon
+ import numpy as np
+ eps_targ = np.random.uniform(0, 1)
+ print('\n\nInside reset_one(idx={}), current eps {:.3f}, set to {:.3f}\n'.format(
+ idx, eps, eps_targ))
+ self.distribution.set_epsilon(eps_targ)
to override the parent's reset_one
method.
- The last step is to get the epsilon greedy agent to avoid setting the sampler. This is done here at the last line:
https://github.com/astooke/rlpyt/blob/668290d1ca94e9d193388a599d4f719bc3a23fba/rlpyt/agents/dqn/epsilon_greedy.py#L100-L111
In other words, the above method will linearly interpolate within the desired decay range of epsilon to get the value we want. I just comment out the last line of that method. That way, only the reset_one
method is responsible for setting an epsilon value. I also added a debug print there to tell which itr it is.
To test, I run python examples/example_5.py --game breakout
because Breakout will lose lives quickly in many cases, hence making it easy to test my print statements (as you can see in the git diff). I quickly get values like this in order, where I copy and paste a consecutive part of the logging output starting from when the current eps is re-drawn.
Inside reset_one(idx=0), current eps 1.000, set to 0.605
0% [# ] 100% | ETA: 00:00:01sample_mode(itr=5), eps_sample is: 1.0
sample_mode(itr=5), eps_sample is: 1.0
sample_mode(itr=5), eps_sample is: 1.0
Inside reset_one(idx=0), current eps 1.000, set to 0.612
sample_mode(itr=6), eps_sample is: 1.0
sample_mode(itr=6), eps_sample is: 1.0
sample_mode(itr=6), eps_sample is: 1.0
sample_mode(itr=7), eps_sample is: 1.0
sample_mode(itr=7), eps_sample is: 1.0
sample_mode(itr=7), eps_sample is: 1.0
sample_mode(itr=8), eps_sample is: 1.0
sample_mode(itr=8), eps_sample is: 1.0
sample_mode(itr=8), eps_sample is: 1.0
0% [## ] 100% | ETA: 00:00:01sample_mode(itr=9), eps_sample is: 1.0
sample_mode(itr=9), eps_sample is: 1.0
sample_mode(itr=9), eps_sample is: 1.0
sample_mode(itr=10), eps_sample is: 1.0
sample_mode(itr=10), eps_sample is: 1.0
sample_mode(itr=10), eps_sample is: 1.0
Inside reset_one(idx=0), current eps 0.605, set to 0.203
sample_mode(itr=11), eps_sample is: 1.0
sample_mode(itr=11), eps_sample is: 1.0
sample_mode(itr=11), eps_sample is: 1.0
Inside reset_one(idx=0), current eps 0.612, set to 0.881
It seems like this is working successfully, in that one of the parallel envs will draw 0.605 as the first epsilon, and then later it draws 0.203. Meanwhile the other parallel env first draws 0.512, then draws 0.881, etc. You can see this happens very quickly (just 11-ish itrs) because Breakout quickly loses lives when the ball goes right past the agent.
I believe this is basically intended behavior.
To keep it simple, I think I'm going to keep the number of workers (args.n_parallel
) equal to the number of parallel envs B
, so that there is exactly one env per worker. This way I avoid the case of potentially changing epsilons in between episodes. If there are two environments in one worker, if one of the two finishes, it will reset epsilon but potentially not before the second one finishes. This can probably be fixed with an epsilon vector but I'm getting scalars by default so might as well stick with those. Does this make sense?
@astooke After looking at the code a bit more carefully, the GPU sampler is a bit more complex, for two reasons. As you pointed out, this calls the agent's reset_one.
https://github.com/astooke/rlpyt/blob/668290d1ca94e9d193388a599d4f719bc3a23fba/rlpyt/samplers/parallel/gpu/action_server.py#L12-L18
Two things:
-
Unlike in the CPU case, it does not seem to expose the
env_info
which I can use to callgetattr(env_info, 'traj_done', d)
. [Of course, this is only because I'd like to reset epsilon after each "true" episode, not after each life lost.] -
In the CPU sampler, if I ask for 2 parallel workers (
args.n_parallel
), and 2 parallel envs (B
), then one worker gets one env. Therefore, each of theact_pyt, rew_pyt, obs_pyt
, etc., in the collector (permalink) has one item in it because there's only one env. (If we did 2 workers for 4 envs, then each of those would have two items, etc.Suppose I use the same number of parallel workers and parallel envs in the GPU case. The logic is different: in the method above, we have
step_np.action, step_np.done, step_np.reward
and so on, but these will account for both envs, i.e.,step_np.done
will be either[True, False]
,[False, True]
,[False, False]
, or[True, True]
, depending on the circumstances, and the other items will likewise have dimension 2=num envs. Therefore, we must use vector valued epsilons in the GPU case and eachreset_one
call should explicitly change only the corresponding index in the epsilon vector.
Is there an example of using vector-valued epsilons? Right now seems like the default is just a single scalar. In that case it may be easier to use the CPU sampler because if we set the number of parallel envs to be the same as the number of workers, we only have to deal with scalars, as discussed in my prior post.
Therefore it probably is easier to use the CPU sampler. We can close this unless someone else wants to chime in about GPU sampling.
Hi! Yes this all makes sense. And good catch about changing the agent's sample_mode()
method.
Indeed, in the GPU sampler, if you want to have different epsilon for each environment, then you must use vector epsilon, because all the environments have their actions sampled together.
As for an example, the setting is kind of buried in one of the config files for R2D1: https://github.com/astooke/rlpyt/blob/668290d1ca94e9d193388a599d4f719bc3a23fba/rlpyt/experiments/configs/atari/dqn/atari_r2d1.py#L67
But it's all in the EpsilonGreedyAgent
. It always needs a eps_final
kwarg. And if you provide an eps_final_min
, it will make a log-spaced vector of epsilon values according to the number of environments. If you're running the regular GPU sampler, then the global_B
is just the batch_B
of the sampler, and the env_ranks
will just be like list(range(batch_B))
. (But it's set up to support multiple sampler instances in parallel, each with different assignment of epsilon values.)
The vector epsilon should be pretty easy to use, just using the same print/debug statements you already have. If the dimensions don't line up correctly, it'll throw an error anyway (i.e. if you have the wrong-sized vector).
As for the env_info
not coming through to the GPU sampler, hmm, you could make the step_buffer include an extra field for that if you need. But I would guess that any time you are resetting the agent (imagine resetting the LSTM state), you would want to change the epsilon then?