acme
acme copied to clipboard
proper way to save and load
Hello!
a tutorial for proper setup of experiments, saving, logging and loading would be much appreciated! I run into problems restoring checkpoints:
Currently I am using the following setup:
- working directory is ~/acme
- experiment "fun" happens in ~/acme/fun, where the main.py for the experiment lies:
from absl import app
from absl import flags
import acme
from acme import wrappers
from acme.agents.tf import dqn
import tensorflow as tf
from acme import specs
import sonnet as snt
from acme.testing import fakes
import numpy as np
import acme.tf.networks as networks
import acme.agents.tf.r2d2 as r2d2
flags.DEFINE_integer('n_episodes', 1000, 'number of games')
FLAGS = flags.FLAGS
def main(_):
class SimpleNetwork(networks.RNNCore):
def __init__(self, action_spec: specs.DiscreteArray):
super().__init__(name='r2d2_test_network')
self._net = snt.DeepRNN([
snt.Flatten(),
snt.LSTM(20),
snt.nets.MLP([50, 50, action_spec.num_values])
])
def __call__(self, inputs, state):
return self._net(inputs, state)
def initial_state(self, batch_size: int, **kwargs):
return self._net.initial_state(batch_size)
def unroll(self, inputs, state, sequence_length):
return snt.static_unroll(self._net, inputs, state, sequence_length)
# Create a fake environment to test with.
environment = fakes.DiscreteEnvironment(
num_actions=5,
num_observations=10,
obs_shape=(10, 4),
obs_dtype=np.float32,
episode_length=10)
environment_spec = specs.make_environment_spec(environment)
# Construct the agent.
agent = r2d2.R2D2(
environment_spec=environment_spec,
network=SimpleNetwork(environment_spec.actions),
batch_size=64, # smaller possible but bad, bigger (256) super bad
samples_per_insert=64,
min_replay_size=1000,
store_lstm_state=False,
burn_in_length=4, # super sensible
trace_length=5, # sensible smaller bad bigger too
replay_period=4,
checkpoint=True,
# learning rate has to be lowered to avoid jumping
learning_rate=1e-4,
)
agent._checkpointer._time_delta_minutes = 1.
# agent._learner._network = tf.saved_model.load("snapshots/network")
# Run the environment loop.
loop = acme.EnvironmentLoop(environment, agent)
loop.run(num_episodes=FLAGS.n_episodes)
if __name__ == '__main__':
app.run(main)
- the experiment is executed by:
cd ~/acme/fun
python main.py -acme_id=fun
- after that we have snapshots, checkpoints and environment_loops in fun
- restarting the experiment loads the checkpoints mostly without errors, but not always!
- console output for a experiment named rnn-buffer:
python main.py -acme_id=rnn-buffer
2021-03-22 15:17:13.564136: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0322 15:17:19.799459 139818441774912 csv.py:45] Logging to learner/rnn-buffer/logs/logs.csv
2021-03-22 15:17:19.802444: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-03-22 15:17:19.803318: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-03-22 15:17:19.845075: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-03-22 15:17:19.845119: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (philipp-HP-ZBook-x2-G4): /proc/driver/nvidia/version does not exist
2021-03-22 15:17:19.845558: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-03-22 15:17:19.846011: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
[reverb/cc/platform/tfrecord_checkpointer.cc:144] Initializing TFRecordCheckpointer in /tmp/tmpovile0aq
[reverb/cc/platform/tfrecord_checkpointer.cc:338] Loading latest checkpoint from /tmp/tmpovile0aq
[reverb/cc/platform/default/server.cc:55] Started replay server on port 19581
WARNING:tensorflow:Entity <function _yield_value at 0x7f29b91c0510> appears to be a generator function. It will not be converted by AutoGraph.
W0322 15:17:20.917869 139818441774912 ag_logging.py:146] Entity <function _yield_value at 0x7f29b91c0510> appears to be a generator function. It will not be converted by AutoGraph.
2021-03-22 15:17:21.381026: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-03-22 15:17:21.399016: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 1999965000 Hz
2021-03-22 15:17:21.423113: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
I0322 15:17:21.447039 139818441774912 savers.py:166] Attempting to restore checkpoint: /home/philipp/acme/rnn-buffer/checkpoints/r2d2_learner/ckpt-4
I0322 15:17:22.869575 139818441774912 csv.py:45] Logging to environment_loop/rnn-buffer/logs/logs.csv
INFO:tensorflow:Assets written to: /home/philipp/acme/rnn-buffer/snapshots/network/assets
I0322 15:17:23.409248 139818441774912 builder_impl.py:775] Assets written to: /home/philipp/acme/rnn-buffer/snapshots/network/assets
I0322 15:17:23.414844 139818441774912 savers.py:156] Saving checkpoint: /home/philipp/acme/rnn-buffer/checkpoints/r2d2_learner
[Environment Loop] Episode Length = 1 | Episode Return = 0.0 | Episodes = 168 | Steps = 170 | Steps Per Second = 409.001
[Environment Loop] Episode Length = 1 | Episode Return = 0.0 | Episodes = 556 | Steps = 561 | Steps Per Second = 371.802
[Environment Loop] Episode Length = 1 | Episode Return = 0.0 | Episodes = 956 | Steps = 961 | Steps Per Second = 438.964
[Environment Loop] Episode Length = 1 | Episode Return = 0.0 | Episodes = 1387 | Steps = 1392 | Steps Per Second = 419.263
[Environment Loop] Episode Length = 1 | Episode Return = 0.0 | Episodes = 1814 | Steps = 1824 | Steps Per Second = 396.063
[Environment Loop] Episode Length = 1 | Episode Return = 0.0 | Episodes = 2244 | Steps = 2256 | Steps Per Second = 472.704
[Environment Loop] Episode Length = 1 | Episode Return = 0.0 | Episodes = 2675 | Steps = 2691 | Steps Per Second = 428.340
[Environment Loop] Episode Length = 1 | Episode Return = 0.0 | Episodes = 3100 | Steps = 3118 | Steps Per Second = 458.795
[Environment Loop] Episode Length = 1 | Episode Return = 0.0 | Episodes = 3530 | Steps = 3550 | Steps Per Second = 453.733
[Environment Loop] Episode Length = 1 | Episode Return = 0.0 | Episodes = 3942 | Steps = 3963 | Steps Per Second = 477.548
2021-03-22 15:17:36.672807: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at save_restore_tensor.cc:175 : Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /home/philipp/acme/rnn-buffer/checkpoints/r2d2_learner/ckpt-4
Traceback (most recent call last):
File "main.py", line 119, in <module>
app.run(main)
File "/home/philipp/acme/venv/lib/python3.6/site-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/home/philipp/acme/venv/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "main.py", line 102, in main
loop.run(num_episodes=FLAGS.n_episodes)
File "/home/philipp/acme/venv/lib/python3.6/site-packages/acme/environment_loop.py", line 153, in run
result = self.run_episode()
File "/home/philipp/acme/venv/lib/python3.6/site-packages/acme/environment_loop.py", line 101, in run_episode
self._actor.update()
File "/home/philipp/acme/venv/lib/python3.6/site-packages/acme/agents/tf/r2d2/agent.py", line 148, in update
super().update()
File "/home/philipp/acme/venv/lib/python3.6/site-packages/acme/agents/agent.py", line 87, in update
self._learner.step()
File "/home/philipp/acme/venv/lib/python3.6/site-packages/acme/agents/tf/r2d2/learning.py", line 205, in step
results = self._step()
File "/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
result = self._call(*args, **kwds)
File "/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 871, in _call
self._initialize(args, kwds, add_initializers_to=initializers)
File "/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 726, in _initialize
*args, **kwds))
File "/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2969, in _get_concrete_function_internal_garbage_collected
graph_function, _ = self._maybe_define_function(args, kwargs)
File "/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 3361, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 3206, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/framework/func_graph.py", line 990, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 634, in wrapped_fn
out = weak_wrapped_fn().__wrapped__(*args, **kwds)
File "/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 3887, in bound_method_wrapper
return wrapped_fn(*args, **kwargs)
File "/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/framework/func_graph.py", line 977, in wrapper
raise e.ag_error_metadata.to_exception(e)
tensorflow.python.framework.errors_impl.NotFoundError: in user code:
/home/philipp/acme/venv/lib/python3.6/site-packages/acme/agents/tf/r2d2/learning.py:183 _step *
self._optimizer.apply(gradients, self._network.trainable_variables)
/home/philipp/acme/venv/lib/python3.6/site-packages/sonnet/src/utils.py:64 _decorate_unbound_method *
return decorator_fn(bound_method, self, args, kwargs)
/home/philipp/acme/venv/lib/python3.6/site-packages/sonnet/src/base.py:272 wrap_with_name_scope *
return method(*args, **kwargs)
/home/philipp/acme/venv/lib/python3.6/site-packages/sonnet/src/optimizers/adam.py:118 apply *
self._initialize(parameters)
/home/philipp/acme/venv/lib/python3.6/site-packages/sonnet/src/utils.py:64 _decorate_unbound_method *
return decorator_fn(bound_method, self, args, kwargs)
/home/philipp/acme/venv/lib/python3.6/site-packages/sonnet/src/once.py:93 wrapper *
_check_no_output(wrapped(*args, **kwargs))
/home/philipp/acme/venv/lib/python3.6/site-packages/sonnet/src/utils.py:64 _decorate_unbound_method *
return decorator_fn(bound_method, self, args, kwargs)
/home/philipp/acme/venv/lib/python3.6/site-packages/sonnet/src/base.py:272 wrap_with_name_scope *
return method(*args, **kwargs)
/home/philipp/acme/venv/lib/python3.6/site-packages/sonnet/src/optimizers/adam.py:92 _initialize *
self.m.extend(zero_var(p) for p in parameters)
/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/training/tracking/data_structures.py:575 extend **
super(ListWrapper, self).extend(values)
/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/training/tracking/data_structures.py:348 extend
self.append(value)
/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/training/tracking/data_structures.py:569 append
super(ListWrapper, self).append(value)
/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/training/tracking/data_structures.py:342 append
value = self._track_value(value, self._name_element(len(self._storage)))
/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/training/tracking/data_structures.py:640 _track_value
value = super(ListWrapper, self)._track_value(value=value, name=name)
/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/training/tracking/data_structures.py:178 _track_value
trackable=self, value=value, name=name)
/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/training/tracking/data_structures.py:136 sticky_attribute_assignment
overwrite=True)
/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/training/tracking/base.py:909 _track_trackable
self._handle_deferred_dependencies(name=name, trackable=trackable)
/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/training/tracking/base.py:942 _handle_deferred_dependencies
checkpoint_position.restore(trackable)
/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/training/tracking/base.py:253 restore
restore_ops = trackable._restore_from_checkpoint_position(self) # pylint: disable=protected-access
/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/training/tracking/base.py:973 _restore_from_checkpoint_position
tensor_saveables, python_saveables))
/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/training/tracking/util.py:308 restore_saveables
validated_saveables).restore(self.save_path_tensor, self.options)
/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/training/saving/functional_saver.py:345 restore
restore_ops = restore_fn()
/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/training/saving/functional_saver.py:321 restore_fn
restore_ops.update(saver.restore(file_prefix, options))
/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/training/saving/functional_saver.py:109 restore
file_prefix, tensor_names, tensor_slices, tensor_dtypes)
/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py:1499 restore_v2
ctx=_ctx)
/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py:1537 restore_v2_eager_fallback
attrs=_attrs, ctx=ctx, name=name)
/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/eager/execute.py:60 quick_execute
inputs, attrs, num_outputs)
NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /home/philipp/acme/rnn-buffer/checkpoints/r2d2_learner/ckpt-4 [Op:RestoreV2]
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.m.1
W0322 15:17:36.994191 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.m.1
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.m.2
W0322 15:17:36.994407 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.m.2
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.m.3
W0322 15:17:36.994531 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.m.3
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.m.4
W0322 15:17:36.994619 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.m.4
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.m.5
W0322 15:17:36.994699 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.m.5
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.m.6
W0322 15:17:36.994777 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.m.6
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.m.7
W0322 15:17:36.994873 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.m.7
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.m.8
W0322 15:17:36.994951 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.m.8
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.v.0
W0322 15:17:36.995054 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.v.0
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.v.1
W0322 15:17:36.995142 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.v.1
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.v.2
W0322 15:17:36.995218 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.v.2
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.v.3
W0322 15:17:36.995293 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.v.3
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.v.4
W0322 15:17:36.995398 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.v.4
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.v.5
W0322 15:17:36.995479 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.v.5
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.v.6
W0322 15:17:36.995564 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.v.6
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.v.7
W0322 15:17:36.995656 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.v.7
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.v.8
W0322 15:17:36.995730 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.v.8
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
W0322 15:17:36.995830 139818441774912 util.py:169] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
[reverb/cc/platform/default/server.cc:64] Shutting down replay server
- however if the experiment is restarted it worked.... is there some kind a synchronicity problem with restoring?
- is this in general the right workflow and what bugs and problems are known due to restoring checkpoints?
- as already said above a tutorial would be much appreciated thanks for your help in advance and the great work you are sharing with us!
Yes I also would appreciate an example on saving/loading workflow, I am also experiencing often
NotFoundError: Unsuccessful TensorSliceReader constructor
when trying to load a checkpoint.
Hello! Did anybody manage to get a tutorial/understand what is happening, and would be interested in sharing BCS checkpoints don't work very well for me either and I don't know how to use them :)