stable-baselines [Bug] DQN Exploration divides by 0 when learn steps are small

Training a DQN agent for a few steps fails because of a divide by zero bug here:

class LinearSchedule(Schedule):
    # 
    # .... other code ....
    # 
    def value(self, step):
        # self.schedule_timesteps = 0
        fraction = min(float(step) / self.schedule_timesteps, 1.0)
        return self.initial_p + fraction * (self.final_p - self.initial_p)

Which is a consequence of the following in the learn function in DQN:

def learn(total_timesteps, ....):
    self.exploration = LinearSchedule(
          schedule_timesteps=int(self.exploration_fraction * total_timesteps),
          initial_p=self.exploration_initial_eps,
          final_p=self.exploration_final_eps)

The bug occurs when self.exploration_fraction * total_timesteps is less than 1.

Reproduce

import gym
import stable_baselines as sb

env = gym.make('CartPole-v1')
agent = sb.DQN('MlpPolicy', env)
agent.learn(1)

Traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/stelios/anaconda3/envs/thesis37/lib/python3.7/site-packages/stable_baselines/deepq/dqn.py", line 201, in learn
    update_eps = self.exploration.value(self.num_timesteps)
  File "/Users/stelios/anaconda3/envs/thesis37/lib/python3.7/site-packages/stable_baselines/common/schedules.py", line 107, in value
    fraction = min(float(step) / self.schedule_timesteps, 1.0)
ZeroDivisionError: float division by zero

Secondary issue

Assuming this is fixed, using total_timesteps assumes that the agent will run .learn only one time, when, ideally, the exploration rate should be independent of the number of calls.

Proposed solution

class LinearSchedule(Schedule):

    def __init__(self, schedule_timesteps, final_p, initial_p=1.0):
        self.schedule_timesteps = max(schedule_timesteps, 1)
        self.final_p = final_p
        self.initial_p = initial_p

Mar 21 '20 19:03 PartiallyUntyped

Such low number of training steps does not really make sense, but indeed there should be a more informative catch for this. I would rather go with an assertion that total number of exploration steps is more than one, i.e.

assert (self.exploration_fraction * total_timesteps) > 1

Mar 21 '20 20:03 Miffyli

It does make sense when you are implementing hierarchical agents.

To be exact, I am using many individual agents in a structure like 'feudal learning' and I need a way to stop the agent.learn when the episode is terminated and propagate the termination to higher up agents.

Using the callback.on_step method and returning from the .learn, the agent will skip the network optimisation.

There isn't a callback, or an option for this, so instead I run the agent 1 step at a time and keep track of the observation, the reward, and whether the environment terminated.

Mar 21 '20 20:03 PartiallyUntyped

Ah, right. I see the use there, but I believe this should rather be fixed with a new algorithm, rather than patches like this to DQN. I do not think the learning would work with learn of 1 steps, as DQN learning is done every n_step steps, right?

Regarding callbacks: You can use the new callbacks (v2.10) to do checks on each step. You could use these to detect when episodes end.

Mar 21 '20 20:03 Miffyli

Indeed, DQN learns every n_step, however, it compares against self.num_timesteps so it does work.

if can_sample and self.num_timesteps > self.learning_starts \
                        and self.num_timesteps % self.train_freq == 0:

The only callback that exits the loop (which is a condition for the algorithm to work is the on_step. Exiting on on_step means skipping the network update, not storing the transition and no logging.

Mar 21 '20 20:03 PartiallyUntyped

Right, thanks for clarifying this :). I still think this should not be patched around like this, rather it should be a sanity-check with assertion. If we clip the exploration steps to minimum of 1, then you will end up with full exploration over that one step. This may be desirable at times, but not always (i.e. sometimes you want zero exploration). You can achieve by setting exploration epsilon to zero, but for clarity I do not think we should do quiet clipping right this, and instead inform user of invalid input.

As for your case: You could try exploration_fraction of zero. That should do the trick. If that rises an issue (i.e. exploration fraction can not be zero), then that should be fixed indeed.

Mar 21 '20 21:03 Miffyli

I agree with you. Perhaps an extra argument in learn? For example, explore_over_timesteps that is by default set to None. If it is None it takes exploration_fraction*total_timesteps, keeping the default behavior, if it has a value, that is used instead. Then asserting that explore_over_timesteps >= 1.

Mar 21 '20 21:03 PartiallyUntyped

Hello,

This seems to be an issue very specific to a given problem. I would advise you to derive a CustomDQN from DQN to fit your needs (you could also to the update in the on_step() of the callback as you have access to the model object). I did that here where I needed to train only after each episode.

I'm against adding an extra argument to learn() (we should keep it consistent) and adding an extra default parameter to the constructor looks confusing.

The second issue is also present in the other algorithms, so I would keep it as so to be consistent. (maybe a thing to look at for v3 #576 )

Mar 22 '20 12:03 araffin

Thanks @araffin, Due to some other issues that I encountered, I will derive the classes. For future reference and anyone else that might encounter this, the callback doesn't have access to done, observation, etc. So to access them, one needs to wrap the environment in something to keep track of everything and access it through BaseCallback.training_env.

Mar 22 '20 12:03 PartiallyUntyped

the callback doesn't have access to

for DQN, it does through self.locals, for others, you need to wrap it.

Mar 22 '20 12:03 araffin

It doesn't. I opened an issue on this.

Mar 22 '20 14:03 PartiallyUntyped

stable-baselines stable-baselines copied to clipboard

[Bug] DQN Exploration divides by 0 when learn steps are small

Reproduce

Secondary issue

Proposed solution

stable-baselines
stable-baselines copied to clipboard