agents icon indicating copy to clipboard operation
agents copied to clipboard

ActorDistributionNetwork with bounded array_specs

Open basvanopheusden opened this issue 4 years ago • 7 comments

When building an ActorDistributionNetwork with bounded array_specs, the network occasionally produces actions that violate the bounds. This seems to be a result of the line scale_distribution=False in line 48 of actor_distribution_network.py.

  return normal_projection_network.NormalProjectionNetwork(
      action_spec,
      init_means_output_factor=init_means_output_factor,
      std_bias_initializer_value=std_bias_initializer_value,
      scale_distribution=False)

I was able to workaround the problem by copying this function, changing scale_distribution to True and passing it as an argument to the initializer for ActorDistributionNetwork, but perhaps we can consider changing the default to be True

basvanopheusden avatar Oct 09 '19 17:10 basvanopheusden

What NormalProjectionNetwork does is squashing actions with tanh and it shouldn't go out of the bounds if scale_distribution=False. I am not very sure why this can happen. Would you like to provide more context?

kuanghuei avatar Oct 09 '19 23:10 kuanghuei

Here's a minimal example: I make a very simple custom environment with 1 observation, 1 action, both 10-element vectors with each entry in [0,1]. The dynamics of the environment are such that no matter what action one takes, it terminates with reward 0 after a single step (this is intentionally a very dumb environment)

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf
import numpy as np

from tf_agents.environments import py_environment
from tf_agents.environments import tf_py_environment
from tf_agents.networks import actor_distribution_network
from tf_agents.trajectories import time_step as ts
from tf_agents.specs import array_spec

tf.compat.v1.enable_v2_behavior()

class TestEnv(py_environment.PyEnvironment):

    def __init__(self):
        self._action_spec = array_spec.BoundedArraySpec(
                shape=(10,), dtype=np.float32, minimum=0, maximum=1, name='action')
        self._observation_spec = array_spec.BoundedArraySpec(
                shape=(10,), dtype=np.float32, minimum=0, maximum=1, name='observation')
        self._reset()

    def action_spec(self):
        return self._action_spec

    def observation_spec(self):
        return self._observation_spec

    def _reset(self):
        self._state = np.zeros(10,)
        return ts.restart(np.array([self._state], dtype=np.float32))

    def _step(self, action):
        return ts.termination(np.array([self._state], dtype=np.float32), reward=0)

I then wrap this environment to a TfPyEnvironment, create an ActorDistributionNetwork, and sample an action (without any training, just the initial network weights)

env = tf_py_environment.TFPyEnvironment(TestEnv())

actor_net = actor_distribution_network.ActorDistributionNetwork(
    env.observation_spec(),
    env.action_spec(),
    fc_layer_params=(10,10,10)
)

time_step = env.reset()
action_dist,network_state = actor_net(time_step.observation, time_step.step_type,{})
print(action_dist.sample().numpy().flatten())

This yields random outputs which are not always constrained to [0,1], for example on one run it yields:

[-0.00887287  0.10404605  0.70652753  0.684862   -0.00132495  0.1625691
  1.0059502   0.6492924   1.1910927   0.29230005]

basvanopheusden avatar Oct 10 '19 16:10 basvanopheusden

Is there any update on this issue ? I have an environment that defines actions as

        self._action_spec = (
            array_spec.BoundedArraySpec(
                shape=(1,), dtype=np.int32, minimum=0, maximum=2, name='action'),
            array_spec.BoundedArraySpec(
                shape=(1,), dtype=np.float32, minimum=0, maximum=1, name='action_pct')
        )

And the boundaries of action_pct are also violated (going negative mostly) in the TfPyEnvironment, even though it passes the validate_py_environment with the PyEnvironment originally. Is setting the scale_distribution to True a valid workaround ?

LucCADORET avatar Feb 08 '20 14:02 LucCADORET

PPO does not respect action boundaries: https://github.com/openai/baselines/issues/121. Environment is expected to clip action values. DDPG/D4PG clips action values in its policy. SAC nicely handles this with a tanh squashed action distribution.

If you set scale_distribution to True, it will do tanh-squashing. We are adding action clipping to environment wrapper. Before it happens, you can handle it in your own environment or environment wrapper.

kuanghuei avatar Mar 02 '20 22:03 kuanghuei

See: https://github.com/tensorflow/agents/blob/master/tf_agents/environments/wrappers.py#L442

oars avatar Mar 02 '20 22:03 oars

So TF-Agents ddpg does clipping in policy: https://github.com/tensorflow/agents/blob/master/tf_agents/agents/ddpg/ddpg_agent.py#L166 If you are using ddpg, you should be good.

If you are using TF-Agents PPO, you should use the ActionClipWrapper that @oars mentioned above.

kuanghuei avatar Mar 05 '20 23:03 kuanghuei

I have encountered the following problems, how can I solve them:TypeError: init() got an unexpected keyword argument 'outer_rank' In call to configurable 'NormalProjectionNetwork' (<class 'tf_agents.networks.normal_projection_network.NormalProjectionNetwork'>)

lqchl avatar Apr 08 '24 03:04 lqchl