stable-baselines [question] Using keras in Custom Policy

I am trying to use keras to define my own custom policy, unfortunately after several hours of trying I couldn't get it to train on CartPole.

Here is the CustomPolicy example I have modified to work with Cartpole, and this trains properly.

class CustomPolicy(ActorCriticPolicy):
    def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
        super(CustomPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse, scale=False)

        with tf.variable_scope("model", reuse=reuse):
            activ = tf.nn.tanh

            extracted_features = tf.layers.flatten(self.processed_obs)

            pi_h = extracted_features
            for i, layer_size in enumerate([64, 64]):
                pi_h = activ(tf.layers.dense(pi_h, layer_size, name='pi_fc' + str(i)))
            pi_latent = pi_h

            vf_h = extracted_features
            for i, layer_size in enumerate([64, 64]):
                vf_h = activ(tf.layers.dense(vf_h, layer_size, name='vf_fc' + str(i)))
            value_fn = tf.layers.dense(vf_h, 1, name='vf')
            vf_latent = vf_h

            self.proba_distribution, self.policy, self.q_value = \
                self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)

        self.value_fn = value_fn
        self.initial_state = None        
        self._setup_init()

Here is the Keras version of my implementation that runs, but does NOT train. (tf.keras.layers vs keras.layers) doesn't make a difference.

class KerasPolicy(ActorCriticPolicy):
    def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
        super(KerasPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse, scale=False)

        with tf.variable_scope("model", reuse=reuse):
            flat = tf.keras.layers.Flatten()(self.processed_obs)

            x = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_0')(flat)
            pi_latent = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_1')(x)

            x1 = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_0')(flat)
            vf_latent = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_1')(x1)

            value_fn = tf.keras.layers.Dense(1, name='vf')(vf_latent)

            self.proba_distribution, self.policy, self.q_value = \
                self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)

        self.value_fn = value_fn
        self.initial_state = None
        self._setup_init()

I tried to ensure both implementations are as close to eachother as possible. Any help at this point would be grately appreciated.

Thank you in advance

Keras version: 2.2.2 Tensorflow version: 1.12.0 Stable Baselines version: 2.4.0a

Attached is the minimal code to reproduce the current issue with tensorboard graphs for comparison. custom_model.py.zip

Mar 04 '19 04:03 batu

Hello, I tested your code and ... it worked fine.

See below for minimal code to reproduce (I got reward > 100)

import tensorflow as tf

from stable_baselines import PPO2
from stable_baselines.common.policies import ActorCriticPolicy


class KerasPolicy(ActorCriticPolicy):
    def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
        super(KerasPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse, scale=False)

        with tf.variable_scope("model", reuse=reuse):
            flat = tf.keras.layers.Flatten()(self.processed_obs)

            x = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_0')(flat)
            pi_latent = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_1')(x)

            x1 = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_0')(flat)
            vf_latent = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_1')(x1)

            value_fn = tf.keras.layers.Dense(1, name='vf')(vf_latent)

            self.proba_distribution, self.policy, self.q_value = \
                self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)

        self.value_fn = value_fn
        self.initial_state = None
        self._setup_init()

    def step(self, obs, state=None, mask=None, deterministic=False):
        if deterministic:
            action, value, neglogp = self.sess.run([self.deterministic_action, self._value, self.neglogp],
                                                   {self.obs_ph: obs})
        else:
            action, value, neglogp = self.sess.run([self.action, self._value, self.neglogp],
                                                   {self.obs_ph: obs})
        return action, value, self.initial_state, neglogp

    def proba_step(self, obs, state=None, mask=None):
        return self.sess.run(self.policy_proba, {self.obs_ph: obs})

    def value(self, obs, state=None, mask=None):
        return self.sess.run(self._value, {self.obs_ph: obs})

model = PPO2(KerasPolicy, "CartPole-v1", verbose=1)
model.learn(25000)

env = model.get_env()
obs = env.reset()

reward_sum = 0.0
for _ in range(1000):
    action, _ = model.predict(obs)
    obs, reward, done, _ = env.step(action)
    reward_sum += reward
    env.render()
    if done:
        print("Reward: ", reward_sum)
        reward_sum = 0.0
        obs = env.reset()

env.close()

I'm using tf-gpu (1.8.0) and latest version of stable-baselines (2.5.0a0 this is the gail branch but should not affect the results).

Mar 06 '19 11:03 araffin

Hey,

After having a trying the code, I am getting the same problem.

It seems that under TF 1.12.0 Keras is ignoring the reuse=True of the scope, meaning that the training model does not share all the parameters with the main model and ends up recreating a new independent model (this is visible under tensorboard as the main model only shares 4 tenors with the training model, rather than the 14 with pure TF code)

There isn't much of a fix unfortunatly, as Keras seems to be using tf.Variable rather than tf.get_variable (some reading here and here)

Mar 06 '19 12:03 hill-a

@araffin, @hill-a thank you very much for looking into this! This problem has been haunting me for a while. I think the best case scenario as a bandaid is to downgrade to TF 1.8.0.

The difference between tf.get_variable vs tf.Variable is very unfortunate... Do you have an intuition as to how stable-baselines might change as years go on, given that with TF 2.0 is placing heavy bets on keras for the future facing way of doing things?

Mar 07 '19 18:03 batu

If TF 2.0 were to be Keras-like, in my opinion the fix would be to have policies where the tensors are created, then the observation is passed through in an function like this:

class CustomPolicy(ActorCriticPolicy):
    def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
        super(CustomPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse, scale=True)
        
        self._build_kwargs = kwargs

        with tf.variable_scope("model", reuse=self.reuse):
            activ = tf.nn.relu
            self.extracted_features = nature_cnn(**self._build_kwargs)

            self.pi_layers = []
            for i, layer_size in enumerate([128, 128, 128]):
                self.pi_layers.append(activ(tf.layers.dense(layer_size, name='pi_fc' + str(i))))

            self.vf_layers = []
            for i, layer_size in enumerate([32, 32]):
                self.vf_layers.append(activ(tf.layers.dense(layer_size, name='vf_fc' + str(i))))

             self.value_fn = tf.layers.dense(1, name='vf')
          self._setup_init()

    def build(self, obs):
        with tf.variable_scope("model", reuse=self.reuse):
            pi_h = vf_h = self.extracted_features(obs)

            for layer in self.pi_layers:
                pi_h = layer(ph_h)
            pi_latent = pi_h

            for layer in self.vf_layers:
                vf_h = layer(vf_h)
            value_fn = self.value_fn(vf_h)
            vf_latent = vf_h

            self.proba_distribution, self.policy, self.q_value = \
                self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)

        self.value_fn = value_fn
        self.initial_state = None

of course , with quite a bit of backend to change (the init functions of base policies, and how the models build the policies)

Mar 08 '19 10:03 hill-a

Are there any further plans regarding this? Now that we know TF 2.0 is going to drop tf.variable_scope and even handle sessions differently, will everything pretty much have to be rewritten?

May 14 '19 18:05 michalgregor

When I test the code from @araffin using tensorflow-gpu 1.8 and the latest pip install of stable-baselines on Ubuntu 16.04, I get the following error:

python3 test_custom_policy.py 
Creating environment from the given name, wrapped in a DummyVecEnv.
Traceback (most recent call last):
  File "test_custom_policy.py", line 46, in <module>
    model = PPO2(KerasPolicy, "CartPole-v1", verbose=1)
  File "/usr/local/lib/python3.5/dist-packages/stable_baselines/ppo2/ppo2.py", line 100, in __init__
    self.setup_model()
  File "/usr/local/lib/python3.5/dist-packages/stable_baselines/ppo2/ppo2.py", line 133, in setup_model
    n_batch_step, reuse=False, **self.policy_kwargs)
  File "test_custom_policy.py", line 25, in __init__
    self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)
AttributeError: can't set attribute

Sep 24 '19 14:09 pirobot

Would like to add my vote here as well. Will this get fixed at some point, or will we have to wait for the TF2.0 compatible version? Not being able to use predefined keras layers means that a ton of really useful model and layer libraries are unusable with stable-baselines, and that model code will be less future proof and much more difficult to read and maintain. This is a very unfortunate limitation to an otherwise really nice Deep RL library.

Dec 06 '19 15:12 jckastel

When I test the code from @araffin using tensorflow-gpu 1.8 and the latest pip install of stable-baselines on Ubuntu 16.04, I get the following error:

python3 test_custom_policy.py 
Creating environment from the given name, wrapped in a DummyVecEnv.
Traceback (most recent call last):
  File "test_custom_policy.py", line 46, in <module>
    model = PPO2(KerasPolicy, "CartPole-v1", verbose=1)
  File "/usr/local/lib/python3.5/dist-packages/stable_baselines/ppo2/ppo2.py", line 100, in __init__
    self.setup_model()
  File "/usr/local/lib/python3.5/dist-packages/stable_baselines/ppo2/ppo2.py", line 133, in setup_model
    n_batch_step, reuse=False, **self.policy_kwargs)
  File "test_custom_policy.py", line 25, in __init__
    self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)
AttributeError: can't set attribute

I made some changes to the code as shown below and it seems to be working on stable-baselines (2.9.0) with tf-gpu==1.14.x


import tensorflow as tf
from stable_baselines import PPO2
from stable_baselines.common.policies import ActorCriticPolicy

class KerasPolicy(ActorCriticPolicy):
    def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
        super(KerasPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse, scale=False)

        with tf.variable_scope("model", reuse=reuse):
            flat = tf.keras.layers.Flatten()(self.processed_obs)

            x = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_0')(flat)
            pi_latent = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_1')(x)

            x1 = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_0')(flat)
            vf_latent = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_1')(x1)

            value_fn = tf.keras.layers.Dense(1, name='vf')(vf_latent)

            self._proba_distribution, self._policy, self.q_value = \
                self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)

        self._value_fn = value_fn
        self._setup_init()

    def step(self, obs, state=None, mask=None, deterministic=False):
        if deterministic:
            action, value, neglogp = self.sess.run([self.deterministic_action, self.value_flat, self.neglogp],
                                                   {self.obs_ph: obs})
        else:
            action, value, neglogp = self.sess.run([self.action, self.value_flat, self.neglogp],
                                                   {self.obs_ph: obs})
        return action, value, self.initial_state, neglogp

    def proba_step(self, obs, state=None, mask=None):
        return self.sess.run(self.policy_proba, {self.obs_ph: obs})

    def value(self, obs, state=None, mask=None):
        return self.sess.run(self.value_flat, {self.obs_ph: obs})

model = PPO2(KerasPolicy, "CartPole-v1", verbose=1, tensorboard_log='./log')

model.learn(25000)

env = model.get_env()
obs = env.reset()

reward_sum = 0.0
for _ in range(1000):
    action, _ = model.predict(obs)
    obs, reward, done, _ = env.step(action)
    reward_sum += reward
    env.render()
    if done:
        print("Reward: ", reward_sum)
        reward_sum = 0.0
        obs = env.reset()

env.close()

Feb 25 '20 21:02 AvisekNaug

Running Ubuntu 18.04.2 LTS, Docker 19.03.6 running tensorflow/tensorflow:1.14.0-gpu-py3-jupyter w/ stable_baselines '2.10.0'

FWIW I cannot get PPO2 agent to learn CartPole using this Keras Policy 'as is', whereas when I use the default MlpPolicy, training occurs fine. Discounted reward chart shown here:

@AvisekNaug using your code present above, I would have expected a like-for-like to the default MlpPolicy, using two layers of 64 neuron dense. Are you able to get training to occur successfully?

Apr 08 '20 10:04 jtromans

Running Ubuntu 18.04.2 LTS, Docker 19.03.6 running tensorflow/tensorflow:1.14.0-gpu-py3-jupyter w/ stable_baselines '2.10.0'

FWIW I cannot get PPO2 agent to learn CartPole using this Keras Policy 'as is', whereas when I use the default MlpPolicy, training occurs fine. Discounted reward chart shown here:

@AvisekNaug using your code present above, I would have expected a like-for-like to the default MlpPolicy, using two layers of 64 neuron dense. Are you able to get training to occur successfully?

Yeah, it does not for reasons discussed by @hill-a . It is an issue with Keras. where model=reuse does not seem to work has intended. See his response above. I merely tried to answer pirobots issues for Stable baselines 2.10. But yeah, it does not train with Keras layers properly.

Apr 14 '20 04:04 AvisekNaug

stable-baselines stable-baselines copied to clipboard

[question] Using keras in Custom Policy

stable-baselines
stable-baselines copied to clipboard