agents ActorDistributionNetwork with bounded array

ActorDistributionNetwork with bounded array_specs

Open basvanopheusden opened this issue 4 years ago • 7 comments

When building an ActorDistributionNetwork with bounded array_specs, the network occasionally produces actions that violate the bounds. This seems to be a result of the line scale_distribution=False in line 48 of actor_distribution_network.py.

  return normal_projection_network.NormalProjectionNetwork(
      action_spec,
      init_means_output_factor=init_means_output_factor,
      std_bias_initializer_value=std_bias_initializer_value,
      scale_distribution=False)

I was able to workaround the problem by copying this function, changing scale_distribution to True and passing it as an argument to the initializer for ActorDistributionNetwork, but perhaps we can consider changing the default to be True

Oct 09 '19 17:10 basvanopheusden

What NormalProjectionNetwork does is squashing actions with tanh and it shouldn't go out of the bounds if scale_distribution=False. I am not very sure why this can happen. Would you like to provide more context?

Oct 09 '19 23:10 kuanghuei

Here's a minimal example: I make a very simple custom environment with 1 observation, 1 action, both 10-element vectors with each entry in [0,1]. The dynamics of the environment are such that no matter what action one takes, it terminates with reward 0 after a single step (this is intentionally a very dumb environment)

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf
import numpy as np

from tf_agents.environments import py_environment
from tf_agents.environments import tf_py_environment
from tf_agents.networks import actor_distribution_network
from tf_agents.trajectories import time_step as ts
from tf_agents.specs import array_spec

tf.compat.v1.enable_v2_behavior()

class TestEnv(py_environment.PyEnvironment):

    def __init__(self):
        self._action_spec = array_spec.BoundedArraySpec(
                shape=(10,), dtype=np.float32, minimum=0, maximum=1, name='action')
        self._observation_spec = array_spec.BoundedArraySpec(
                shape=(10,), dtype=np.float32, minimum=0, maximum=1, name='observation')
        self._reset()

    def action_spec(self):
        return self._action_spec

    def observation_spec(self):
        return self._observation_spec

    def _reset(self):
        self._state = np.zeros(10,)
        return ts.restart(np.array([self._state], dtype=np.float32))

    def _step(self, action):
        return ts.termination(np.array([self._state], dtype=np.float32), reward=0)

I then wrap this environment to a TfPyEnvironment, create an ActorDistributionNetwork, and sample an action (without any training, just the initial network weights)

env = tf_py_environment.TFPyEnvironment(TestEnv())

actor_net = actor_distribution_network.ActorDistributionNetwork(
    env.observation_spec(),
    env.action_spec(),
    fc_layer_params=(10,10,10)
)

time_step = env.reset()
action_dist,network_state = actor_net(time_step.observation, time_step.step_type,{})
print(action_dist.sample().numpy().flatten())

This yields random outputs which are not always constrained to [0,1], for example on one run it yields:

[-0.00887287  0.10404605  0.70652753  0.684862   -0.00132495  0.1625691
  1.0059502   0.6492924   1.1910927   0.29230005]

Oct 10 '19 16:10 basvanopheusden

Is there any update on this issue ? I have an environment that defines actions as

        self._action_spec = (
            array_spec.BoundedArraySpec(
                shape=(1,), dtype=np.int32, minimum=0, maximum=2, name='action'),
            array_spec.BoundedArraySpec(
                shape=(1,), dtype=np.float32, minimum=0, maximum=1, name='action_pct')
        )

And the boundaries of action_pct are also violated (going negative mostly) in the TfPyEnvironment, even though it passes the validate_py_environment with the PyEnvironment originally. Is setting the scale_distribution to True a valid workaround ?

Feb 08 '20 14:02 LucCADORET

PPO does not respect action boundaries: https://github.com/openai/baselines/issues/121. Environment is expected to clip action values. DDPG/D4PG clips action values in its policy. SAC nicely handles this with a tanh squashed action distribution.

If you set scale_distribution to True, it will do tanh-squashing. We are adding action clipping to environment wrapper. Before it happens, you can handle it in your own environment or environment wrapper.

Mar 02 '20 22:03 kuanghuei

See: https://github.com/tensorflow/agents/blob/master/tf_agents/environments/wrappers.py#L442

Mar 02 '20 22:03 oars

So TF-Agents ddpg does clipping in policy: https://github.com/tensorflow/agents/blob/master/tf_agents/agents/ddpg/ddpg_agent.py#L166 If you are using ddpg, you should be good.

If you are using TF-Agents PPO, you should use the ActionClipWrapper that @oars mentioned above.

Mar 05 '20 23:03 kuanghuei

I have encountered the following problems, how can I solve them：TypeError: init() got an unexpected keyword argument 'outer_rank' In call to configurable 'NormalProjectionNetwork' (<class 'tf_agents.networks.normal_projection_network.NormalProjectionNetwork'>)

Apr 08 '24 03:04 lqchl

agents agents copied to clipboard

ActorDistributionNetwork with bounded array_specs

agents
agents copied to clipboard