agents
agents copied to clipboard
[PPO] Trajectory action out of range
Hello,
I'm trying to implement the PPO agent using a custom environment with a Discrete spaces object with bounds [0,4), but the agent policy is choosing a number out of range.
action_space = spaces.Discrete(4)
I created two new networks
actor = actor_distribution_network.ActorDistributionNetwork(
train_env.observation_spec(),
train_env.action_spec(),
fc_layer_params=fc_layer_params,
preprocessing_layers=preprocess_layers,
activation_fn=customReLU,
preprocessing_combiner=tf.keras.layers.Concatenate()
)
value_ = value_network.ValueNetwork(
train_env.observation_spec(),
activation_fn=customReLU,
preprocessing_layers=preprocess_layers,
preprocessing_combiner=tf.keras.layers.Concatenate()
)
agent = ppo_agent.PPOAgent(
train_env.time_step_spec(),
train_env.action_spec(),
actor_net=actor,
value_net=value_,
optimizer=optimizer,
use_gae=True,
use_td_lambda_return=True,
)
and verified that the bounds of the action_spec from the resulting agent is [0,4)
However, during my training loop
collect_driver = dynamic_episode_driver.DynamicEpisodeDriver(
train_env,
agent.collect_policy,
observers=[replay_buffer.add_batch],
num_episodes=10
)
for step in range(num_iterations):
collect_driver.run()
trajectories = replay_buffer.gather_all()
train_loss = agent.train(experience=trajectories).loss
replay_buffer.clear()
The loop would run from anywhere between 0-400 iterations with no problems and eventually I end up getting an InvalidArgumentError
:
InvalidArgumentError: Received a label value of 4 which is outside the valid range of [0, 4). Label values: 4 4...[Op:SparseSoftmaxCrossEntropyWithLogits]
Upon further inspection of the trajectories, it seems that the the agent policy is outputting values outside of the action bounds.
action=<tf.Tensor: shape=(1, 60), dtype=int64, numpy= array([[4, 4, ...], dtype=int64)>
I initially thought it was a problem with the activation function with the networks and created a bounded ReLU function in order to limit it, but still the same problems.
Would this be an issue with my environment or the networks I setup?
I'm getting the same exact problem with my environment. Were you able to find a solution?
@ZakSingh Not yet. The weird thing is that I don't run into the same problem with the DQN.
@summer-yue PTAL?
I can confirm this issue for continuous values as well. tf-agent v0.7.1. These seems linked #121, #216
@egonina current rotation; ptal?
Perhaps ActorDistributionNetwork
does not respect boundary values?
I think the problem is in the construction of the Discrete output distribution here. Can you link to a gist with the full traceback of the error?
Don't think this will be helpful - Stack Trace.
Also,
output of print(self.action_spec())
BoundedArraySpec(shape=(2,), dtype=dtype('float32'), name='action', minimum=-1.0, maximum=1.0)
So, expected action to be in range [-1, 1], but got (-1.2548504, 0.55205715).
As mentioned in here, PPO does not handle action bound clipping.
Have you tried the workaround in https://github.com/tensorflow/agents/issues/216 ?
@kuanghuei can you PTAL as well since you're more familiar with PPO and have context on previous issues. Thanks!
For my case, I just clipped action values in my env.
You can alternatively pass a discrete_projection_net
or continuous_projection_net
argument to ActorDistributionNetwork
that is a function that builds a distribution that properly respects your action spec.
For example, if you are using discrete actions, the default is:
def _categorical_projection_net(action_spec, logits_init_output_factor=0.1):
return categorical_projection_network.CategoricalProjectionNetwork(
action_spec, logits_init_output_factor=logits_init_output_factor)
But instead you can use something like:
def create_projection_net(action_spec):
num_actions = action_spec.maximum - action_spec.minimum
return tfa.networks.Sequential([
tf.keras.layers.Dense(num_actions),
tf.keras.layers.Lambda(lambda logits: tfp.distributions.Categorical(logits, dtype=action_spec.dtype))
])
For a continuous network you could instead emit a TruncatedNormal
.
You could also just build a complete Sequential
that emits a Lambda
creating a Distribution
as the full action network instead of relying on ActorDistributionNetwork
. This has been the recommended approach since ~late 2020.
@ebrevdo Could you elaborate on how to do that? I was running into this issue in #216, it continues to pop up occasionally despite the workaround.