duobert DataLossError when loading checkpoints

DataLossError when loading checkpoints

Open Ahmedn1 opened this issue 4 years ago • 0 comments
I'm trying to follow the same steps in the "Replicating our MS MARCO results with duoBERT" section but I get this error when it tries to load the checkpoint:
(doc_env) python run_duobert_msmarco.py   --data_dir=/home/ahmedn1/doc2query/data/tfrecords/   --bert_config_file=/home/ahmedn1/doc2query/data/model/bert_config.json   --output_dir=/home/ahmedn1/doc2query/model2/   --init_checkpoint=/home/ahmedn1/doc2query/data/model2/bert-large-msmarco-pretrained-only/model.ckpt-100000.data-00000-of-00001   --max_seq_length=512   --do_train=False   --do_eval=True   --eval_batch_size=128   --num_eval_docs=30   --use_tpu=False
/home/ahmedn1/doc2query/doc_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/ahmedn1/doc2query/doc_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/ahmedn1/doc2query/doc_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/ahmedn1/doc2query/doc_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/ahmedn1/doc2query/doc_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/ahmedn1/doc2query/doc_env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
WARNING:tensorflow:Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7f2502a41598>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Using config: {'_model_dir': '/home/ahmedn1/doc2query/model2/', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f24ffd87710>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None), '_cluster': None}
INFO:tensorflow:_TPUContext: eval_on_tpu True
WARNING:tensorflow:eval_on_tpu ignored because use_tpu is False.
INFO:tensorflow:***** Running evaluation *****
INFO:tensorflow:    Batch size = 128
2020-03-20 21:24:41.500510: W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open /home/ahmedn1/doc2query/data/model2/bert-large-msmarco-pretrained-only/model.ckpt-100000.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
WARNING:tensorflow:Reraising captured error
Traceback (most recent call last):
  File "run_duobert_msmarco.py", line 496, in <module>
    tf.app.run()
  File "/home/ahmedn1/doc2query/doc_env/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "run_duobert_msmarco.py", line 423, in main
    for item in result:
  File "/home/ahmedn1/doc2query/doc_env/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2437, in predict
    rendezvous.raise_errors()
  File "/home/ahmedn1/doc2query/doc_env/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors
    six.reraise(typ, value, traceback)
  File "/home/ahmedn1/doc2query/doc_env/lib/python3.6/site-packages/six.py", line 703, in reraise
    raise value
  File "/home/ahmedn1/doc2query/doc_env/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2431, in predict
    yield_single_examples=yield_single_examples):
  File "/home/ahmedn1/doc2query/doc_env/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 551, in predict
    features, None, model_fn_lib.ModeKeys.PREDICT, self.config)
  File "/home/ahmedn1/doc2query/doc_env/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2186, in _call_model_fn
    features, labels, mode, config)
  File "/home/ahmedn1/doc2query/doc_env/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1169, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/home/ahmedn1/doc2query/doc_env/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2470, in _model_fn
    features, labels, is_export_mode=is_export_mode)
  File "/home/ahmedn1/doc2query/doc_env/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1250, in call_without_tpu
    return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
  File "/home/ahmedn1/doc2query/doc_env/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1524, in _call_model_fn
    estimator_spec = self._model_fn(features=features, **kwargs)
  File "run_duobert_msmarco.py", line 188, in model_fn
    tvars, init_checkpoint)
  File "/home/ahmedn1/doc2query/data/duobert/modeling.py", line 331, in get_assignment_map_from_checkpoint
    init_vars = tf.train.list_variables(init_checkpoint)
  File "/home/ahmedn1/doc2query/doc_env/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 95, in list_variables
    reader = load_checkpoint(ckpt_dir_or_file)
  File "/home/ahmedn1/doc2query/doc_env/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 64, in load_checkpoint
    return pywrap_tensorflow.NewCheckpointReader(filename)
  File "/home/ahmedn1/doc2query/doc_env/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 316, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern), status)
  File "/home/ahmedn1/doc2query/doc_env/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 526, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file /home/ahmedn1/doc2query/data/model2/bert-large-msmarco-pretrained-only/model.ckpt-100000.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
I tried using tensorflow 1.11, 1.13.1, 1.15
Mar 20 '20 21:03 Ahmedn1
duobert duobert copied to clipboard

DataLossError when loading checkpoints

duobert
duobert copied to clipboard