deepmd-kit icon indicating copy to clipboard operation
deepmd-kit copied to clipboard

[BUG] compress training cannot restart

Open njzjz opened this issue 3 years ago • 3 comments

Bug summary

Compress training cannot use --restart or --init_model.

DeePMD-kit Version

v2.1.1.dev48+g899d1020.d20220505, i.e. 899d1020

TensorFlow Version

2.7.0

How did you download the software?

Built from source

Input Files, Running Commands, Error Log, etc.

dp train input.json -r model.ckpt
2022-05-10 19:09:39.919871: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at save_restore_v2_ops.cc:207 : NOT_FOUND: Key descrpt_attr/t_avg not found in checkpoint
Traceback (most recent call last):
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1380, in _do_call
    return fn(*args)
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1363, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1456, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
  (0) NOT_FOUND: Key descrpt_attr/t_avg not found in checkpoint
	 [[{{node save/RestoreV2}}]]
	 [[save/RestoreV2/_259]]
  (1) NOT_FOUND: Key descrpt_attr/t_avg not found in checkpoint
	 [[{{node save/RestoreV2}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 1404, in restore
    sess.run(self.saver_def.restore_op_name,
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 970, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1193, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1373, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1399, in _do_call
    raise type(e)(node_def, op, message)  # pylint: disable=no-value-for-parameter
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
  (0) NOT_FOUND: Key descrpt_attr/t_avg not found in checkpoint
	 [[node save/RestoreV2
 (defined at /home/jz748/codes/deepmd-kit/deepmd/train/trainer.py:401)
]]
	 [[save/RestoreV2/_259]]
  (1) NOT_FOUND: Key descrpt_attr/t_avg not found in checkpoint
	 [[node save/RestoreV2
 (defined at /home/jz748/codes/deepmd-kit/deepmd/train/trainer.py:401)
]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node save/RestoreV2:
In[0] save/Const:	
In[1] save/RestoreV2/tensor_names:	
In[2] save/RestoreV2/shape_and_slices:

Operation defined at: (most recent call last)
>>>   File "/home/jz748/anaconda3/envs/dpdev113/bin/dp", line 33, in <module>
>>>     sys.exit(load_entry_point('deepmd-kit', 'console_scripts', 'dp')())
>>> 
>>>   File "/home/jz748/codes/deepmd-kit/deepmd/entrypoints/main.py", line 472, in main
>>>     train_dp(**dict_args)
>>> 
>>>   File "/home/jz748/codes/deepmd-kit/deepmd/entrypoints/train.py", line 106, in train
>>>     _do_work(jdata, run_opt, is_compress)
>>> 
>>>   File "/home/jz748/codes/deepmd-kit/deepmd/entrypoints/train.py", line 167, in _do_work
>>>     model.train(train_data, valid_data)
>>> 
>>>   File "/home/jz748/codes/deepmd-kit/deepmd/train/trainer.py", line 445, in train
>>>     self._init_session()
>>> 
>>>   File "/home/jz748/codes/deepmd-kit/deepmd/train/trainer.py", line 401, in _init_session
>>>     self.saver = tf.train.Saver(save_relative_paths=True)
>>> 

Input Source operations connected to node save/RestoreV2:
In[0] save/Const:	
In[1] save/RestoreV2/tensor_names:	
In[2] save/RestoreV2/shape_and_slices:

Operation defined at: (most recent call last)
>>>   File "/home/jz748/anaconda3/envs/dpdev113/bin/dp", line 33, in <module>
>>>     sys.exit(load_entry_point('deepmd-kit', 'console_scripts', 'dp')())
>>> 
>>>   File "/home/jz748/codes/deepmd-kit/deepmd/entrypoints/main.py", line 472, in main
>>>     train_dp(**dict_args)
>>> 
>>>   File "/home/jz748/codes/deepmd-kit/deepmd/entrypoints/train.py", line 106, in train
>>>     _do_work(jdata, run_opt, is_compress)
>>> 
>>>   File "/home/jz748/codes/deepmd-kit/deepmd/entrypoints/train.py", line 167, in _do_work
>>>     model.train(train_data, valid_data)
>>> 
>>>   File "/home/jz748/codes/deepmd-kit/deepmd/train/trainer.py", line 445, in train
>>>     self._init_session()
>>> 
>>>   File "/home/jz748/codes/deepmd-kit/deepmd/train/trainer.py", line 401, in _init_session
>>>     self.saver = tf.train.Saver(save_relative_paths=True)
>>> 

Original stack trace for 'save/RestoreV2':
  File "/home/jz748/anaconda3/envs/dpdev113/bin/dp", line 33, in <module>
    sys.exit(load_entry_point('deepmd-kit', 'console_scripts', 'dp')())
  File "/home/jz748/codes/deepmd-kit/deepmd/entrypoints/main.py", line 472, in main
    train_dp(**dict_args)
  File "/home/jz748/codes/deepmd-kit/deepmd/entrypoints/train.py", line 106, in train
    _do_work(jdata, run_opt, is_compress)
  File "/home/jz748/codes/deepmd-kit/deepmd/entrypoints/train.py", line 167, in _do_work
    model.train(train_data, valid_data)
  File "/home/jz748/codes/deepmd-kit/deepmd/train/trainer.py", line 445, in train
    self._init_session()
  File "/home/jz748/codes/deepmd-kit/deepmd/train/trainer.py", line 401, in _init_session
    self.saver = tf.train.Saver(save_relative_paths=True)
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 923, in __init__
    self.build()
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 935, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 963, in _build
    self.saver_def = self._builder._build_internal(  # pylint: disable=protected-access
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 533, in _build_internal
    restore_op = self._AddRestoreOps(filename_tensor, saveables,
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 353, in _AddRestoreOps
    all_tensors = self.bulk_restore(filename_tensor, saveables, preferred_shard,
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 601, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1501, in restore_v2
    _, _, _op, _outputs = _op_def_library._apply_op_helper(
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/framework/op_def_library.py", line 744, in _apply_op_helper
    op = g._create_op_internal(op_type_name, inputs, dtypes=None,
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 3698, in _create_op_internal
    ret = Operation(
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 2101, in __init__
    self._traceback = tf_stack.extract_stack_for_node(self._c_op)


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 70, in get_tensor
    return CheckpointReader.CheckpointReader_GetTensor(
RuntimeError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 1415, in restore
    names_to_keys = object_graph_key_mapping(save_path)
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 1736, in object_graph_key_mapping
    object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY)
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 75, in get_tensor
    error_translator(e)
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 35, in error_translator
    raise errors_impl.NotFoundError(None, None, error_message)
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jz748/anaconda3/envs/dpdev113/bin/dp", line 33, in <module>
    sys.exit(load_entry_point('deepmd-kit', 'console_scripts', 'dp')())
  File "/home/jz748/codes/deepmd-kit/deepmd/entrypoints/main.py", line 472, in main
    train_dp(**dict_args)
  File "/home/jz748/codes/deepmd-kit/deepmd/entrypoints/train.py", line 106, in train
    _do_work(jdata, run_opt, is_compress)
  File "/home/jz748/codes/deepmd-kit/deepmd/entrypoints/train.py", line 167, in _do_work
    model.train(train_data, valid_data)
  File "/home/jz748/codes/deepmd-kit/deepmd/train/trainer.py", line 445, in train
    self._init_session()
  File "/home/jz748/codes/deepmd-kit/deepmd/train/trainer.py", line 418, in _init_session
    self.saver.restore (self.sess, self.run_opt.restart)
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 1420, in restore
    raise _wrap_restore_error_with_msg(
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

2 root error(s) found.
  (0) NOT_FOUND: Key descrpt_attr/t_avg not found in checkpoint
	 [[node save/RestoreV2
 (defined at /home/jz748/codes/deepmd-kit/deepmd/train/trainer.py:401)
]]
	 [[save/RestoreV2/_259]]
  (1) NOT_FOUND: Key descrpt_attr/t_avg not found in checkpoint
	 [[node save/RestoreV2
 (defined at /home/jz748/codes/deepmd-kit/deepmd/train/trainer.py:401)
]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node save/RestoreV2:
In[0] save/Const:	
In[1] save/RestoreV2/tensor_names:	
In[2] save/RestoreV2/shape_and_slices:

Operation defined at: (most recent call last)
>>>   File "/home/jz748/anaconda3/envs/dpdev113/bin/dp", line 33, in <module>
>>>     sys.exit(load_entry_point('deepmd-kit', 'console_scripts', 'dp')())
>>> 
>>>   File "/home/jz748/codes/deepmd-kit/deepmd/entrypoints/main.py", line 472, in main
>>>     train_dp(**dict_args)
>>> 
>>>   File "/home/jz748/codes/deepmd-kit/deepmd/entrypoints/train.py", line 106, in train
>>>     _do_work(jdata, run_opt, is_compress)
>>> 
>>>   File "/home/jz748/codes/deepmd-kit/deepmd/entrypoints/train.py", line 167, in _do_work
>>>     model.train(train_data, valid_data)
>>> 
>>>   File "/home/jz748/codes/deepmd-kit/deepmd/train/trainer.py", line 445, in train
>>>     self._init_session()
>>> 
>>>   File "/home/jz748/codes/deepmd-kit/deepmd/train/trainer.py", line 401, in _init_session
>>>     self.saver = tf.train.Saver(save_relative_paths=True)
>>> 

Input Source operations connected to node save/RestoreV2:
In[0] save/Const:	
In[1] save/RestoreV2/tensor_names:	
In[2] save/RestoreV2/shape_and_slices:

Operation defined at: (most recent call last)
>>>   File "/home/jz748/anaconda3/envs/dpdev113/bin/dp", line 33, in <module>
>>>     sys.exit(load_entry_point('deepmd-kit', 'console_scripts', 'dp')())
>>> 
>>>   File "/home/jz748/codes/deepmd-kit/deepmd/entrypoints/main.py", line 472, in main
>>>     train_dp(**dict_args)
>>> 
>>>   File "/home/jz748/codes/deepmd-kit/deepmd/entrypoints/train.py", line 106, in train
>>>     _do_work(jdata, run_opt, is_compress)
>>> 
>>>   File "/home/jz748/codes/deepmd-kit/deepmd/entrypoints/train.py", line 167, in _do_work
>>>     model.train(train_data, valid_data)
>>> 
>>>   File "/home/jz748/codes/deepmd-kit/deepmd/train/trainer.py", line 445, in train
>>>     self._init_session()
>>> 
>>>   File "/home/jz748/codes/deepmd-kit/deepmd/train/trainer.py", line 401, in _init_session
>>>     self.saver = tf.train.Saver(save_relative_paths=True)
>>> 

Original stack trace for 'save/RestoreV2':
  File "/home/jz748/anaconda3/envs/dpdev113/bin/dp", line 33, in <module>
    sys.exit(load_entry_point('deepmd-kit', 'console_scripts', 'dp')())
  File "/home/jz748/codes/deepmd-kit/deepmd/entrypoints/main.py", line 472, in main
    train_dp(**dict_args)
  File "/home/jz748/codes/deepmd-kit/deepmd/entrypoints/train.py", line 106, in train
    _do_work(jdata, run_opt, is_compress)
  File "/home/jz748/codes/deepmd-kit/deepmd/entrypoints/train.py", line 167, in _do_work
    model.train(train_data, valid_data)
  File "/home/jz748/codes/deepmd-kit/deepmd/train/trainer.py", line 445, in train
    self._init_session()
  File "/home/jz748/codes/deepmd-kit/deepmd/train/trainer.py", line 401, in _init_session
    self.saver = tf.train.Saver(save_relative_paths=True)
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 923, in __init__
    self.build()
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 935, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 963, in _build
    self.saver_def = self._builder._build_internal(  # pylint: disable=protected-access
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 533, in _build_internal
    restore_op = self._AddRestoreOps(filename_tensor, saveables,
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 353, in _AddRestoreOps
    all_tensors = self.bulk_restore(filename_tensor, saveables, preferred_shard,
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 601, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1501, in restore_v2
    _, _, _op, _outputs = _op_def_library._apply_op_helper(
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/framework/op_def_library.py", line 744, in _apply_op_helper
    op = g._create_op_internal(op_type_name, inputs, dtypes=None,
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 3698, in _create_op_internal
    ret = Operation(
  File "/home/jz748/anaconda3/envs/dpdev113/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 2101, in __init__
    self._traceback = tf_stack.extract_stack_for_node(self._c_op)

Steps to Reproduce

Switch to examples/water/se_e2_a/:

cd examples/water/se_e2_a/
dp train input.json
# stop after 1000 steps
dp freeze
dp compress
dp train input.json -f frozen_model_compressed.pb
# stop after 1000 steps
dp train input.json -r model.ckpt

Further Information, Files, and Links

No response

njzjz avatar May 10 '22 23:05 njzjz

The model in the compressed training is changed (embedding net is compressed). One cannot provide restart the training from a changed model.

wanghan-iapcm avatar May 11 '22 01:05 wanghan-iapcm

I've changed the issue type to enhancement. In theory, it's possible to read the model from the frozen model and read the parameters from the checkpoint at the same time.

njzjz avatar May 11 '22 20:05 njzjz

I agree with you

wanghan-iapcm avatar May 12 '22 01:05 wanghan-iapcm