agents icon indicating copy to clipboard operation
agents copied to clipboard

[PPO] Trajectory action out of range

Open pbautista-apt opened this issue 4 years ago • 13 comments

Hello,

I'm trying to implement the PPO agent using a custom environment with a Discrete spaces object with bounds [0,4), but the agent policy is choosing a number out of range. action_space = spaces.Discrete(4)

I created two new networks

actor = actor_distribution_network.ActorDistributionNetwork(
				train_env.observation_spec(),
				train_env.action_spec(),
				fc_layer_params=fc_layer_params,
				preprocessing_layers=preprocess_layers,
				activation_fn=customReLU,
				preprocessing_combiner=tf.keras.layers.Concatenate()
)
value_ = value_network.ValueNetwork(
				train_env.observation_spec(),
				activation_fn=customReLU,
				preprocessing_layers=preprocess_layers,
				preprocessing_combiner=tf.keras.layers.Concatenate()
)
agent = ppo_agent.PPOAgent(
	train_env.time_step_spec(),
	train_env.action_spec(),
	actor_net=actor,
	value_net=value_,
	optimizer=optimizer,
	use_gae=True,
    use_td_lambda_return=True,
)

and verified that the bounds of the action_spec from the resulting agent is [0,4) image

However, during my training loop

collect_driver = dynamic_episode_driver.DynamicEpisodeDriver(
    train_env,
    agent.collect_policy,
    observers=[replay_buffer.add_batch],
    num_episodes=10
)

for step in range(num_iterations):
    collect_driver.run()
    trajectories = replay_buffer.gather_all()
    train_loss = agent.train(experience=trajectories).loss
    replay_buffer.clear()

The loop would run from anywhere between 0-400 iterations with no problems and eventually I end up getting an InvalidArgumentError:

InvalidArgumentError: Received a label value of 4 which is outside the valid range of [0, 4). Label values: 4 4...[Op:SparseSoftmaxCrossEntropyWithLogits]

Upon further inspection of the trajectories, it seems that the the agent policy is outputting values outside of the action bounds. action=<tf.Tensor: shape=(1, 60), dtype=int64, numpy= array([[4, 4, ...], dtype=int64)>

I initially thought it was a problem with the activation function with the networks and created a bounded ReLU function in order to limit it, but still the same problems.

Would this be an issue with my environment or the networks I setup?

pbautista-apt avatar Feb 18 '21 00:02 pbautista-apt

I'm getting the same exact problem with my environment. Were you able to find a solution?

ZakSingh avatar Mar 04 '21 22:03 ZakSingh

@ZakSingh Not yet. The weird thing is that I don't run into the same problem with the DQN.

pbautista-apt avatar Mar 05 '21 23:03 pbautista-apt

@summer-yue PTAL?

ebrevdo avatar Mar 18 '21 05:03 ebrevdo

I can confirm this issue for continuous values as well. tf-agent v0.7.1. These seems linked #121, #216

RachithP avatar Jun 23 '21 21:06 RachithP

@egonina current rotation; ptal?

ebrevdo avatar Jun 23 '21 22:06 ebrevdo

Perhaps ActorDistributionNetwork does not respect boundary values?

ebrevdo avatar Jun 23 '21 22:06 ebrevdo

I think the problem is in the construction of the Discrete output distribution here. Can you link to a gist with the full traceback of the error?

ebrevdo avatar Jun 23 '21 22:06 ebrevdo

Don't think this will be helpful - Stack Trace.

Also, output of print(self.action_spec())

BoundedArraySpec(shape=(2,), dtype=dtype('float32'), name='action', minimum=-1.0, maximum=1.0)

So, expected action to be in range [-1, 1], but got (-1.2548504, 0.55205715).

As mentioned in here, PPO does not handle action bound clipping.

RachithP avatar Jun 24 '21 00:06 RachithP

Have you tried the workaround in https://github.com/tensorflow/agents/issues/216 ?

@kuanghuei can you PTAL as well since you're more familiar with PPO and have context on previous issues. Thanks!

egonina avatar Jun 24 '21 14:06 egonina

For my case, I just clipped action values in my env.

RachithP avatar Jun 24 '21 21:06 RachithP

You can alternatively pass a discrete_projection_net or continuous_projection_net argument to ActorDistributionNetwork that is a function that builds a distribution that properly respects your action spec.

For example, if you are using discrete actions, the default is:

def _categorical_projection_net(action_spec, logits_init_output_factor=0.1):
  return categorical_projection_network.CategoricalProjectionNetwork(
      action_spec, logits_init_output_factor=logits_init_output_factor)

But instead you can use something like:

def create_projection_net(action_spec):
  num_actions = action_spec.maximum - action_spec.minimum
  return tfa.networks.Sequential([
     tf.keras.layers.Dense(num_actions),
     tf.keras.layers.Lambda(lambda logits: tfp.distributions.Categorical(logits, dtype=action_spec.dtype))
  ])

For a continuous network you could instead emit a TruncatedNormal.

ebrevdo avatar Jun 24 '21 22:06 ebrevdo

You could also just build a complete Sequential that emits a Lambda creating a Distribution as the full action network instead of relying on ActorDistributionNetwork. This has been the recommended approach since ~late 2020.

ebrevdo avatar Jun 24 '21 22:06 ebrevdo

@ebrevdo Could you elaborate on how to do that? I was running into this issue in #216, it continues to pop up occasionally despite the workaround.

basvanopheusden avatar Jul 12 '21 21:07 basvanopheusden