agents icon indicating copy to clipboard operation
agents copied to clipboard

Loaded policy eval runs 4 times faster than original policy eval

Open maxima120 opened this issue 2 years ago • 0 comments

I have made RL solution largely based on: https://www.tensorflow.org/agents/tutorials/1_dqn_tutorial

After training finished I run eval once. Then I save the policy and run eval again. The timing is 400% different. Is this expected? Is there a reasonable explanation?

This is original eval:

start = timer()
avg_return, total_rewards = policy_eval(eval_env, agent.policy, total_eval_episodes)
end = timer()
print('{2} | steps = {0:6}: Average Return = {1:<+9e}, per step: {3}'.format(total_eval_episodes, avg_return, timedelta(seconds=end-start), timedelta(seconds=(end-start)/total_eval_episodes)))

Out:
0:25:27.958856 | steps =   1000: Average Return = -4.910102e-07, per step: 0:00:01.527959

this is save/load:

tf_policy_saver = policy_saver.PolicySaver(agent.policy)
tf_policy_saver.save(policy_dir)
. . .
saved_policy = tf.saved_model.load(policy_dir)

Eval using loaded policy:

start = timer()
avg_return2, total_rewards2 = policy_eval(eval_env, saved_policy, total_eval_episodes)
end = timer()
print('{2} | Saved policy: steps = {0:6}: Average Return = {1:<+9e}, per step: {3}'.format(total_eval_episodes, avg_return2, timedelta(seconds=end-start), timedelta(seconds=(end-start)/total_eval_episodes)))

Out:
0:03:47.331221 | Saved policy: steps =   1000: Average Return = -7.780847e-07, per step: 0:00:00.227331

eval function (almost same as in the tutorial:

def policy_eval(environment, policy, num_episodes=10):

  total_return = 0.0
  episode_returns = []
  policy_state = policy.get_initial_state(environment.batch_size)

  for _ in range(num_episodes):

    time_step = environment.reset()

    while not time_step.is_last():
      action_step = policy.action(time_step, policy_state)
      policy_state = action_step.state
      time_step = environment.step(action_step.action)
      total_return += time_step.reward
      episode_returns.append(time_step.reward)

  avg_return = total_return / num_episodes
    
  return avg_return.numpy()[0], episode_returns

maxima120 avatar Sep 20 '22 20:09 maxima120