agents
agents copied to clipboard
ActorDistributionNetwork with bounded array_specs
When building an ActorDistributionNetwork with bounded array_specs, the network occasionally produces actions that violate the bounds. This seems to be a result of the line scale_distribution=False
in line 48 of actor_distribution_network.py
.
return normal_projection_network.NormalProjectionNetwork(
action_spec,
init_means_output_factor=init_means_output_factor,
std_bias_initializer_value=std_bias_initializer_value,
scale_distribution=False)
I was able to workaround the problem by copying this function, changing scale_distribution
to True
and passing it as an argument to the initializer for ActorDistributionNetwork
, but perhaps we can consider changing the default to be True
What NormalProjectionNetwork does is squashing actions with tanh and it shouldn't go out of the bounds if scale_distribution=False
. I am not very sure why this can happen. Would you like to provide more context?
Here's a minimal example: I make a very simple custom environment with 1 observation, 1 action, both 10-element vectors with each entry in [0,1]. The dynamics of the environment are such that no matter what action one takes, it terminates with reward 0 after a single step (this is intentionally a very dumb environment)
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
import numpy as np
from tf_agents.environments import py_environment
from tf_agents.environments import tf_py_environment
from tf_agents.networks import actor_distribution_network
from tf_agents.trajectories import time_step as ts
from tf_agents.specs import array_spec
tf.compat.v1.enable_v2_behavior()
class TestEnv(py_environment.PyEnvironment):
def __init__(self):
self._action_spec = array_spec.BoundedArraySpec(
shape=(10,), dtype=np.float32, minimum=0, maximum=1, name='action')
self._observation_spec = array_spec.BoundedArraySpec(
shape=(10,), dtype=np.float32, minimum=0, maximum=1, name='observation')
self._reset()
def action_spec(self):
return self._action_spec
def observation_spec(self):
return self._observation_spec
def _reset(self):
self._state = np.zeros(10,)
return ts.restart(np.array([self._state], dtype=np.float32))
def _step(self, action):
return ts.termination(np.array([self._state], dtype=np.float32), reward=0)
I then wrap this environment to a TfPyEnvironment
, create an ActorDistributionNetwork
, and sample an action (without any training, just the initial network weights)
env = tf_py_environment.TFPyEnvironment(TestEnv())
actor_net = actor_distribution_network.ActorDistributionNetwork(
env.observation_spec(),
env.action_spec(),
fc_layer_params=(10,10,10)
)
time_step = env.reset()
action_dist,network_state = actor_net(time_step.observation, time_step.step_type,{})
print(action_dist.sample().numpy().flatten())
This yields random outputs which are not always constrained to [0,1], for example on one run it yields:
[-0.00887287 0.10404605 0.70652753 0.684862 -0.00132495 0.1625691
1.0059502 0.6492924 1.1910927 0.29230005]
Is there any update on this issue ? I have an environment that defines actions as
self._action_spec = (
array_spec.BoundedArraySpec(
shape=(1,), dtype=np.int32, minimum=0, maximum=2, name='action'),
array_spec.BoundedArraySpec(
shape=(1,), dtype=np.float32, minimum=0, maximum=1, name='action_pct')
)
And the boundaries of action_pct
are also violated (going negative mostly) in the TfPyEnvironment, even though it passes the validate_py_environment
with the PyEnvironment originally. Is setting the scale_distribution
to True
a valid workaround ?
PPO does not respect action boundaries: https://github.com/openai/baselines/issues/121. Environment is expected to clip action values. DDPG/D4PG clips action values in its policy. SAC nicely handles this with a tanh squashed action distribution.
If you set scale_distribution to True, it will do tanh-squashing. We are adding action clipping to environment wrapper. Before it happens, you can handle it in your own environment or environment wrapper.
See: https://github.com/tensorflow/agents/blob/master/tf_agents/environments/wrappers.py#L442
So TF-Agents ddpg does clipping in policy: https://github.com/tensorflow/agents/blob/master/tf_agents/agents/ddpg/ddpg_agent.py#L166 If you are using ddpg, you should be good.
If you are using TF-Agents PPO, you should use the ActionClipWrapper that @oars mentioned above.
I have encountered the following problems, how can I solve them:TypeError: init() got an unexpected keyword argument 'outer_rank' In call to configurable 'NormalProjectionNetwork' (<class 'tf_agents.networks.normal_projection_network.NormalProjectionNetwork'>)