dopamine icon indicating copy to clipboard operation
dopamine copied to clipboard

Provided checkpoint files not sufficient to restore agent?

Open csherstan opened this issue 7 years ago • 5 comments

For a given game and run (say Qbert/1/) the following files are included:

  • tf_ckpt-199.data-00000-of-00001
  • tf_ckpt-199.index
  • tf_ckpt-199.meta

However, the function dopamine.common.checkpointer.get_latest_checkpoint_number looks for files with sentinel_checkpoint_complete.* to determine the largest checkpoint file to load.

def get_latest_checkpoint_number(base_directory):
  """Returns the version number of the latest completed checkpoint.

  Args:
    base_directory: str, directory in which to look for checkpoint files.

  Returns:
    int, the iteration number of the latest checkpoint, or -1 if none was found.
  """
  glob = os.path.join(base_directory, 'sentinel_checkpoint_complete.*')
  def extract_iteration(x):
    return int(x[x.rfind('.') + 1:])
  try:
    checkpoint_files = tf.gfile.Glob(glob)
  except tf.errors.NotFoundError:
    return -1
  try:
    latest_iteration = max(extract_iteration(x) for x in checkpoint_files)
    return latest_iteration
  except ValueError:
    return -1

The list checkpoint_files is empty.

As a result the check on dopamine.atart.run_experiment.py:204 fails and unbundle, which would read in the provided checkpoint files, never gets called on the agent.

csherstan avatar Nov 09 '18 03:11 csherstan

Further, in the agent's unbundle function the memory buffer first tries to restore itself, but those files are missing.

Perhaps I'm not understanding how these checkpoint files are intended to be used.

csherstan avatar Nov 09 '18 03:11 csherstan

hi craig, yes, you are correct, this is a bug. unfortunately we are not providing the saved checkpoints for the replay buffer and other non-tensorflow objects. there are a number of reasons, size is one of them. but i agree the code should support reloading only the graph without requiring reloading all of the other objects. i'll try to get a fix out there in the next few days.

psc-g avatar Nov 09 '18 14:11 psc-g

Hi, @psc-g has this issue been fixed. I'm equally stuck trying to reload a saved agent.

harnix avatar May 03 '19 16:05 harnix

thanks for reminding me of this. working on fix now.

psc-g avatar May 03 '19 17:05 psc-g

this is fixed in https://github.com/google/dopamine/commit/76cdae1f858233a8501e2b61095cde54c6f8a214

you should be able to force a specific checkpoint to be reloaded without requiring all the other files by using

--gin_bindings="DQNAgent.allow_partial_reload=True" \
--gin_bindings="checkpointer.get_latest_checkpoint_number=199"

let me know if tihs fixes your issue or not.

psc-g avatar May 05 '19 14:05 psc-g