Provided checkpoint files not sufficient to restore agent?
For a given game and run (say Qbert/1/) the following files are included:
- tf_ckpt-199.data-00000-of-00001
- tf_ckpt-199.index
- tf_ckpt-199.meta
However, the function dopamine.common.checkpointer.get_latest_checkpoint_number looks for files with sentinel_checkpoint_complete.* to determine the largest checkpoint file to load.
def get_latest_checkpoint_number(base_directory):
"""Returns the version number of the latest completed checkpoint.
Args:
base_directory: str, directory in which to look for checkpoint files.
Returns:
int, the iteration number of the latest checkpoint, or -1 if none was found.
"""
glob = os.path.join(base_directory, 'sentinel_checkpoint_complete.*')
def extract_iteration(x):
return int(x[x.rfind('.') + 1:])
try:
checkpoint_files = tf.gfile.Glob(glob)
except tf.errors.NotFoundError:
return -1
try:
latest_iteration = max(extract_iteration(x) for x in checkpoint_files)
return latest_iteration
except ValueError:
return -1
The list checkpoint_files is empty.
As a result the check on dopamine.atart.run_experiment.py:204 fails and unbundle, which would read in the provided checkpoint files, never gets called on the agent.
Further, in the agent's unbundle function the memory buffer first tries to restore itself, but those files are missing.
Perhaps I'm not understanding how these checkpoint files are intended to be used.
hi craig, yes, you are correct, this is a bug. unfortunately we are not providing the saved checkpoints for the replay buffer and other non-tensorflow objects. there are a number of reasons, size is one of them. but i agree the code should support reloading only the graph without requiring reloading all of the other objects. i'll try to get a fix out there in the next few days.
Hi, @psc-g has this issue been fixed. I'm equally stuck trying to reload a saved agent.
thanks for reminding me of this. working on fix now.
this is fixed in https://github.com/google/dopamine/commit/76cdae1f858233a8501e2b61095cde54c6f8a214
you should be able to force a specific checkpoint to be reloaded without requiring all the other files by using
--gin_bindings="DQNAgent.allow_partial_reload=True" \
--gin_bindings="checkpointer.get_latest_checkpoint_number=199"
let me know if tihs fixes your issue or not.