bi-att-flow icon indicating copy to clipboard operation
bi-att-flow copied to clipboard

Dev branch fails with TF r1.4

Open David-Levinthal opened this issue 6 years ago • 25 comments

ubuntu 16.04, python3.5, cuda9.0, cudnn7.0 nccl2, follow the installation and all is fine. python3 -m basic.cli --mode train --noload --debug fails..first on _linear solution per issue 41 from tensorflow.contrib.rnn.python.ops.rnn_cell import _Linear then there is a problem with flags. R1.4 handles argument parsing differently and adding components to config causes errors in the flag parsing as they are not declared as flags and then change the call in the nn.py…but then I hit File "/home/levinth/bi-att-flow/basic/cli.py", line 112, in tf.app.run() File "/home/levinth/tf_r1.4_c9_mpi_py3/tensorflow/python/platform/app.py", line 124, in run _sys.exit(main(argv)) File "/home/levinth/bi-att-flow/basic/cli.py", line 107, in main config.out_dir = os.path.join(config.out_base_dir, config.model_name, str(config.run_id).zfill(2)) File "/home/levinth/tf_r1.4_c9_mpi_py3/tensorflow/python/platform/flags.py", line 88, in setattr return self.dict['__wrapped'].setattr(name, value) File "/home/levinth/tf_r1.4_c9_mpi_py3/absl/flags/_flagvalues.py", line 496, in setattr return self._set_unknown_flag(name, value) File "/home/levinth/tf_r1.4_c9_mpi_py3/absl/flags/_flagvalues.py", line 374, in _set_unknown_flag raise _exceptions.UnrecognizedFlagError(name, value) absl.flags._exceptions.UnrecognizedFlagError: Unknown command line flag 'out_dir'

ultimately I added a lot of flags but was not sure what to do when a matrix was added to config. After adding all these… #flags added for R1.4? flags.DEFINE_string("out_dir", "out", "out dir [out]") flags.DEFINE_string("save_dir", "save", "save dir [save]") flags.DEFINE_string("log_dir", "log", "log dir [log]") flags.DEFINE_string("eval_dir", "eval", "eval dir [eval]") flags.DEFINE_string("answer_dir", "answer", "answer dir [answer]") flags.DEFINE_integer("max_num_sents", 0, "max_num_sents") flags.DEFINE_integer("max_sent_size", 0, "max_sent_size") flags.DEFINE_integer("max_ques_size", 0, "max_ques_size") flags.DEFINE_integer("max_word_size", 0, "max_word_size") flags.DEFINE_integer("max_para_size", 0, "max_para_size") flags.DEFINE_integer("char_vocab_size", 0, "char_vocab_size") flags.DEFINE_integer("word_emb_size", 0, "word_emb_size") flags.DEFINE_integer("word_vocab_size", 0, "word_vocab_size") I hit the issue that emb_mat is not a flag

emb_mat = np.array([idx2vec_dict[idx] if idx in idx2vec_dict
                    else np.random.multivariate_normal(np.zeros(config.word_emb_size), np.eye(config.word_emb_size))
                    for idx in range(config.word_vocab_size)])
config.emb_mat = emb_mat

David-Levinthal avatar Jan 05 '18 20:01 David-Levinthal

note: R1.4 is required for cuda9, cuda9.1 etc...

David-Levinthal avatar Jan 05 '18 20:01 David-Levinthal

Has there been any update on this issue? I'm trying to run train this (multi-GPU) on a DGX-1 (8xV100), this requires CUDA 9, which requires TF1.4.

tanmayb123 avatar Mar 09 '18 00:03 tanmayb123

Hi , hoping to get a solution for this issue . thanks

harirajeev avatar Apr 05 '18 14:04 harirajeev

Hi, +1 with this issue ... no resolution?

ewagner70 avatar Apr 07 '18 15:04 ewagner70

Same issue here as well.

ioana-blue avatar Apr 11 '18 20:04 ioana-blue

Got it to run on TF 1.7 and Python 3.6 someone might think its helpful: https://github.com/allenai/bi-att-flow/pull/89 (dev branch pull request)

klintan avatar Apr 16 '18 18:04 klintan

Andreas When you ran this on R1.7..what did you do about _linear imported into my/tensorflow/nn.py there are 2 versions in tensorflow/tensorflow/contrib/rnn/python/ops/rnn_cell.py and one in tensorflow/tensorflow/contrib/rnn/python/ops/core_rnn_cell.py and none in any file in tensorflow/tensorflow/python/ops/ or did I miss something

On Mon, Apr 16, 2018 at 11:24 AM, Andreas Klintberg < [email protected]> wrote:

Got it to run on TF 1.7 and Python 3.6 someone might think its helpful: #89 https://github.com/allenai/bi-att-flow/pull/89 (dev branch pull request)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/allenai/bi-att-flow/issues/69#issuecomment-381702190, or mute the thread https://github.com/notifications/unsubscribe-auth/AIUuT0WazL1rkuG5NDPaZ-M1ed7eiJQWks5tpOHegaJpZM4RU7Ny .

David-Levinthal avatar Apr 17 '18 17:04 David-Levinthal

If I understand your question, no TF does not expose the _linear function anymore https://github.com/tensorflow/tensorflow/issues/561 .

Basically I actually added the linear functions and appropriate imports into the my/tensorflow/nn.py but you are right, the _linear is still available in TF 1.7 (in contrib) https://github.com/tensorflow/tensorflow/blob/1d76d3e7c7eecddee960c20c9896ccc43d7ccd5c/tensorflow/contrib/rnn/python/ops/core_rnn_cell.py so I guess a better solution is probably to just import that instead :)

klintan avatar Apr 17 '18 18:04 klintan

is that the right one?

On Tue, Apr 17, 2018 at 11:31 AM, Andreas Klintberg < [email protected]> wrote:

If I understand your question, no TF does not expose the _linear function anymore tensorflow/tensorflow#561 https://github.com/tensorflow/tensorflow/issues/561 .

Basically I actually added the linear functions and appropriate imports into the my/tensorflow/nn.py but you are right, the _linear is still available in TF 1.7 (in contrib) https://github.com/tensorflow/ tensorflow/blob/1d76d3e7c7eecddee960c20c9896cc c43d7ccd5c/tensorflow/contrib/rnn/python/ops/core_rnn_cell.py so I guess a better solution is probably to just import that instead :)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/allenai/bi-att-flow/issues/69#issuecomment-382096071, or mute the thread https://github.com/notifications/unsubscribe-auth/AIUuTxLGuAginw32VXo6wHRCQJE3OtGWks5tpjUOgaJpZM4RU7Ny .

David-Levinthal avatar Apr 17 '18 19:04 David-Levinthal

using the _linear in core_rnn_cell gets the imports to work.. but then I hit the original error in https://github.com/allenai/bi-att-flow/issues/69

caused by def main(_): config = flags.FLAGS

config.out_dir = os.path.join(config.out_base_dir, config.model_name,

str(config.run_id).zfill(2))

upset that out_dir is not defined in flags.. how did you work around that issue? d

On Tue, Apr 17, 2018 at 11:31 AM, Andreas Klintberg < [email protected]> wrote:

If I understand your question, no TF does not expose the _linear function anymore tensorflow/tensorflow#561 https://github.com/tensorflow/tensorflow/issues/561 .

Basically I actually added the linear functions and appropriate imports into the my/tensorflow/nn.py but you are right, the _linear is still available in TF 1.7 (in contrib) https://github.com/tensorflow/ tensorflow/blob/1d76d3e7c7eecddee960c20c9896cc c43d7ccd5c/tensorflow/contrib/rnn/python/ops/core_rnn_cell.py so I guess a better solution is probably to just import that instead :)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/allenai/bi-att-flow/issues/69#issuecomment-382096071, or mute the thread https://github.com/notifications/unsubscribe-auth/AIUuTxLGuAginw32VXo6wHRCQJE3OtGWks5tpjUOgaJpZM4RU7Ny .

David-Levinthal avatar Apr 17 '18 19:04 David-Levinthal

Disclaimer: I'm just a happy hacker. However looking at the code now the contrib package, only contains one implementation of the _linear from core_rnn_cell.py, so there is probably just one.

As to the "FLAGS" issue, all my changes is reflected here https://github.com/allenai/bi-att-flow/pull/89/files

But in summary added all missing Flags definition you initially did, but also added the flags.DEFINE_integer("emb_mat", 0, "embedding matrix") However, the default value 0 for the emb_mat is probably not correct :/ so I'm still hoping someone could comment on the value for that.

I trained on a train and dev set I created myself and got around 80% F1 score which seems plausible so it doesn't seem to mess up anything major. I will try to do a training on the SQuAD set as well for comparison.

klintan avatar Apr 17 '18 20:04 klintan

This is one of the only good question answer public source bases, so it is an important component for the benchmark suite a lot of people (myself included) are discussing. Getting something that can be easily downloaded and run out of the box is extremely important.. I wonder if it might be a good idea if you forked off the dev branch and then applied your patches so we had something that could be run?

On Tue, Apr 17, 2018 at 1:10 PM, Andreas Klintberg <[email protected]

wrote:

Disclaimer: I'm just a happy hacker. However looking at the code now the contrib package, only contains one implementation of the _linear from core_rnn_cell.py, so there is probably just one.

As to the "FLAGS" issue, all my changes is reflected here https://github.com/allenai/bi-att-flow/pull/89/files

But in summary added all missing Flags definition you initially did, but also added the flags.DEFINE_integer("emb_mat", 0, "embedding matrix") However, the default value 0 for the emb_mat is probably not correct :/ so I'm still hoping someone could comment on the value for that.

I trained on a train and dev set I created myself and got around 80% F1 score which seems plausible so it doesn't seem to mess up anything major. I will try to do a training on the SQuAD set as well for comparison.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/allenai/bi-att-flow/issues/69#issuecomment-382126153, or mute the thread https://github.com/notifications/unsubscribe-auth/AIUuT9XXQp2IXm5S8sLyKlycto6ejvg7ks5tpkxQgaJpZM4RU7Ny .

David-Levinthal avatar Apr 17 '18 20:04 David-Levinthal

yeah, I will try to run it this week and compare to the old repo results.

I forked and commited here https://github.com/klintan/bi-att-flow/tree/dev , let me know if you try it out and if you have any problems.

klintan avatar Apr 17 '18 20:04 klintan

hey are you in linked in or some other way to email you? are you interested in helping define a DNN benchmark suite? d

On Tue, Apr 17, 2018 at 1:28 PM, Andreas Klintberg <[email protected]

wrote:

yeah, I will try to run it this week and compare to the old repo results.

I forked and commited here https://github.com/klintan/bi-att-flow/tree/dev , let me know if you try it out and if you have any problems.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/allenai/bi-att-flow/issues/69#issuecomment-382130958, or mute the thread https://github.com/notifications/unsubscribe-auth/AIUuT1uyvz1DYQm9zvUSsXnY_73X7Hziks5tplBQgaJpZM4RU7Ny .

David-Levinthal avatar Apr 17 '18 20:04 David-Levinthal

Andreas One more question.. using your forked distro (out of the box) python -m basic.cli fails as it looks for out/basic/00/shared.json which does not exist on the other hand python -m basic.cli --mode train --noload --debug seems to run

num params: 2695851 2018-04-17 14:58:16.177176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38 pciBusID: 0000:04:00.0 totalMemory: 15.77GiB freeMemory: 15.35GiB 2018-04-17 14:58:16.177214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0 2018-04-17 14:58:16.420130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-04-17 14:58:16.420176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0 2018-04-17 14:58:16.420182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N 2018-04-17 14:58:16.420490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14868 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:04:00.0, compute capability: 7.0) 100%|\u2588 | 2/2 [00:28<00:00, 14.66s/it]

where was shared.json supposed to come from?? d

On Tue, Apr 17, 2018 at 1:29 PM, David Levinthal <[email protected]

wrote:

hey are you in linked in or some other way to email you? are you interested in helping define a DNN benchmark suite? d

On Tue, Apr 17, 2018 at 1:28 PM, Andreas Klintberg < [email protected]> wrote:

yeah, I will try to run it this week and compare to the old repo results.

I forked and commited here https://github.com/klintan/bi- att-flow/tree/dev , let me know if you try it out and if you have any problems.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/allenai/bi-att-flow/issues/69#issuecomment-382130958, or mute the thread https://github.com/notifications/unsubscribe-auth/AIUuT1uyvz1DYQm9zvUSsXnY_73X7Hziks5tplBQgaJpZM4RU7Ny .

David-Levinthal avatar Apr 17 '18 22:04 David-Levinthal

Sure! https://www.linkedin.com/in/andreas-klintberg-b7655710/ feel free to send me some more info :)

python -m basic.cli --mode train --noload --debug just does a "dry-run" of the training.

python -m basic.cli --mode train --noload --len_opt --cluster to train it, took about 20 hours on my dataset on a TitanX (Maxwell)

to answer your question: python -m basic.cli is for testing, so I guess shared.json is created as part of the training (which needs to be done before the testing)

klintan avatar Apr 17 '18 22:04 klintan

Actually I figured the python3 -m basic.cli thing...running either of the --mode train commands creates the file.. ie the instructions are in the wrong order :-)

On Tue, Apr 17, 2018 at 3:24 PM, Andreas Klintberg <[email protected]

wrote:

Sure! https://www.linkedin.com/in/andreas-klintberg-b7655710/ feel free to send me some more info :)

python -m basic.cli --mode train --noload --debug just does a "dry-run" of the training.

python -m basic.cli --mode train --noload --len_opt --cluster to train it, took about 20 hours on my dataset on a TitanX (Maxwell)

to answer your question: python -m basic.cli is for testing, so I guess shared.json is created as part of the training (which needs to be done before the testing)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/allenai/bi-att-flow/issues/69#issuecomment-382175179, or mute the thread https://github.com/notifications/unsubscribe-auth/AIUuT7wd78lBfNp7bqbGhPk2bIt4_iSqks5tpmuggaJpZM4RU7Ny .

David-Levinthal avatar Apr 18 '18 15:04 David-Levinthal

hi, I have been getting this when i run your branch of the code: NotFoundError (see above for traceback): Tensor name "emb/char/conv/xx/conv1d_5/bias/ExponentialMovingAverage" not found in checkpoint files save/37/save

What can i do to fix it @klintan

vidhumalik avatar Apr 21 '18 18:04 vidhumalik

what branch are you running and what version of TF? the error you mention is part of the original error report

On Sat, Apr 21, 2018 at 11:10 AM, slothie26 [email protected] wrote:

hi, I have been getting this when i run your branch of the code: raise _exceptions.UnrecognizedFlagError(name, value) absl.flags._exceptions.UnrecognizedFlagError: Unknown command line flag 'new_emb_mat'

What can i do to fix it

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/allenai/bi-att-flow/issues/69#issuecomment-383317468, or mute the thread https://github.com/notifications/unsubscribe-auth/AIUuT0eSSgyyZtfXHgJs8qyQBKYfLGhpks5tq3X_gaJpZM4RU7Ny .

David-Levinthal avatar Apr 21 '18 19:04 David-Levinthal

@slothie26 seems you edited you original question, however I think it might be if you try to use multiple gpus as mentioned here https://github.com/allenai/bi-att-flow/issues/54 If you restrict to one GPU it should work.

klintan avatar Apr 21 '18 19:04 klintan

I was able to fix my original problem by including a flag define line in the cli. But I again got the ExponentialMovingAvergaeError which I cant seem to get rid off. i have tried so many versions of the project. I currently ran https://github.com/klintan/bi-att-flow/tree/dev using TensorFlow 1.7 I am not sending the number of GPUS.It is default set as 1. But the error still occurs. It would be great if you can help me with this.. Sharing the complete Error Report: Loading saved model from save/37/save Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call return fn(*args) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1312, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun status, run_metadata) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.NotFoundError: Tensor name "emb/char/conv/xx/conv1d_5/bias/ExponentialMovingAverage" not found in checkpoint files save/37/save [[Node: save_1/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save_1/Const_0_0, save_1/RestoreV2/tensor_names, save_1/RestoreV2/shape_and_slices)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/Users/vmalik2/bi-att-flow-dev/basic/cli.py", line 129, in tf.app.run() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/Users/vmalik2/bi-att-flow-dev/basic/cli.py", line 126, in main m(config) File "/Users/vmalik2/bi-att-flow-dev/basic/main.py", line 29, in main _forward(config) File "/Users/vmalik2/bi-att-flow-dev/basic/main.py", line 200, in _forward graph_handler.initialize(sess) File "/Users/vmalik2/bi-att-flow-dev/basic/graph_handler.py", line 25, in initialize self._load(sess) File "/Users/vmalik2/bi-att-flow-dev/basic/graph_handler.py", line 54, in _load saver.restore(sess, save_path) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1775, in restore {self.saver_def.filename_tensor_name: save_path}) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 905, in run run_metadata_ptr) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1140, in _run feed_dict_tensor, options, run_metadata) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run run_metadata) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.NotFoundError: Tensor name "emb/char/conv/xx/conv1d_5/bias/ExponentialMovingAverage" not found in checkpoint files save/37/save [[Node: save_1/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save_1/Const_0_0, save_1/RestoreV2/tensor_names, save_1/RestoreV2/shape_and_slices)]]

Caused by op 'save_1/RestoreV2', defined at: File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/Users/vmalik2/bi-att-flow-dev/basic/cli.py", line 129, in tf.app.run() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/Users/vmalik2/bi-att-flow-dev/basic/cli.py", line 126, in main m(config) File "/Users/vmalik2/bi-att-flow-dev/basic/main.py", line 29, in main _forward(config) File "/Users/vmalik2/bi-att-flow-dev/basic/main.py", line 200, in _forward graph_handler.initialize(sess) File "/Users/vmalik2/bi-att-flow-dev/basic/graph_handler.py", line 25, in initialize self._load(sess) File "/Users/vmalik2/bi-att-flow-dev/basic/graph_handler.py", line 42, in load saver = tf.train.Saver(vars, max_to_keep=config.max_to_keep) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1311, in init self.build() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1320, in build self._build(self._filename, build_save=True, build_restore=True) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1357, in _build build_save=build_save, build_restore=build_restore) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 809, in _build_internal restore_sequentially, reshape) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 448, in _AddRestoreOps restore_sequentially) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 860, in bulk_restore return io_ops.restore_v2(filename_tensor, names, slices, dtypes) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1458, in restore_v2 shape_and_slices=shape_and_slices, dtypes=dtypes, name=name) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3290, in create_op op_def=op_def) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1654, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

NotFoundError (see above for traceback): Tensor name "emb/char/conv/xx/conv1d_5/bias/ExponentialMovingAverage" not found in checkpoint files save/37/save [[Node: save_1/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save_1/Const_0_0, save_1/RestoreV2/tensor_names, save_1/RestoreV2/shape_and_slices)]]

vidhumalik avatar Apr 22 '18 04:04 vidhumalik

Can it have something to do with the fact that when I print The tensors in checkpoint file, I get this: model_0/emb/char/conv/xx/conv1d_5/bias/ExponentialMovingAverage But, the trainable variable name is : emb/char/conv/xx/conv1d_5/bias/ExponentialMovingAverage

vidhumalik avatar Apr 22 '18 05:04 vidhumalik

@slothie26 are you running this command: python -m basic.cli --mode train --noload --len_opt --cluster ?

edit: Further looking at the error message, its using your CPU _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save_1/Const_0_0, Reading the readme,

The model has ~2.5M parameters. The model was trained with NVidia Titan X (Pascal Architecture, 2016). The model requires at least 12GB of GPU RAM. If your GPU RAM is smaller than 12GB, you can either decrease batch size (performance might degrade), or you can use multi GPU (see below).

You need at least 12 GB of GPU memory to train it.

klintan avatar Apr 22 '18 05:04 klintan

No, I am trying to run the pretrained vectors. Running this command: basic/run_single.sh $HOME/data/squad/dev-v1.1.json single.json

vidhumalik avatar Apr 22 '18 05:04 vidhumalik

are you running this with an MKL build of TF? https://github.com/allenai/bi-att-flow/issues/66 There seem to be a bunch of old issues about ExponentialMovingAverage associated with reading the checkpoint file. Both Andreas and I have been able to run this on single Nvidia GPUs without modification on R1.7 (I build TF from source with a checkout to force R1.7) What cuda/cudnn libraries and what OS are you using? This really should be a separate issue d

On Sat, Apr 21, 2018 at 9:13 PM, slothie26 [email protected] wrote:

I was able to fix my original problem by including a flag define line in the cli. But I again got the ExponentialMovingAvergaeError which I cant seem to get rid off. i have tried so many versions of the project. I currently ran https://github.com/klintan/bi-att-flow/tree/dev using TensorFlow 1.7 I am not sending the number of GPUS.It is default set as 1. But the error still occurs. It would be great if you can help me with this.. Sharing the complete Error Report: Loading saved model from save/37/save Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.6/lib/ python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call return fn(*args) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/ python3.6/site-packages/tensorflow/python/client/session.py", line 1312, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/ python3.6/site-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun status, run_metadata) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/ python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.NotFoundError: Tensor name "emb/char/conv/xx/conv1d_5/bias/ExponentialMovingAverage" not found in checkpoint files save/37/save [[Node: save_1/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save_1/Const_0_0, save_1/RestoreV2/tensor_names, save_1/RestoreV2/shape_and_slices)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/Users/vmalik2/bi-att-flow-dev/basic/cli.py", line 129, in tf.app.run() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/ python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/Users/vmalik2/bi-att-flow-dev/basic/cli.py", line 126, in main m(config) File "/Users/vmalik2/bi-att-flow-dev/basic/main.py", line 29, in main _forward(config) File "/Users/vmalik2/bi-att-flow-dev/basic/main.py", line 200, in _forward graph_handler.initialize(sess) File "/Users/vmalik2/bi-att-flow-dev/basic/graph_handler.py", line 25, in initialize self._load(sess) File "/Users/vmalik2/bi-att-flow-dev/basic/graph_handler.py", line 54, in _load saver.restore(sess, save_path) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/ python3.6/site-packages/tensorflow/python/training/saver.py", line 1775, in restore {self.saver_def.filename_tensor_name: save_path}) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/ python3.6/site-packages/tensorflow/python/client/session.py", line 905, in run run_metadata_ptr) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/ python3.6/site-packages/tensorflow/python/client/session.py", line 1140, in _run feed_dict_tensor, options, run_metadata) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/ python3.6/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run run_metadata) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/ python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.NotFoundError: Tensor name "emb/char/conv/xx/conv1d_5/bias/ExponentialMovingAverage" not found in checkpoint files save/37/save [[Node: save_1/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save_1/Const_0_0, save_1/RestoreV2/tensor_names, save_1/RestoreV2/shape_and_slices)]]

Caused by op 'save_1/RestoreV2', defined at: File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/Users/vmalik2/bi-att-flow-dev/basic/cli.py", line 129, in tf.app.run() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/ python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/Users/vmalik2/bi-att-flow-dev/basic/cli.py", line 126, in main m(config) File "/Users/vmalik2/bi-att-flow-dev/basic/main.py", line 29, in main _forward(config) File "/Users/vmalik2/bi-att-flow-dev/basic/main.py", line 200, in _forward graph_handler.initialize(sess) File "/Users/vmalik2/bi-att-flow-dev/basic/graph_handler.py", line 25, in initialize self._load(sess) File "/Users/vmalik2/bi-att-flow-dev/basic/graph_handler.py", line 42, in load saver = tf.train.Saver(vars, max_to_keep=config.max_to_keep) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/ python3.6/site-packages/tensorflow/python/training/saver.py", line 1311, in init self.build() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/ python3.6/site-packages/tensorflow/python/training/saver.py", line 1320, in build self._build(self._filename, build_save=True, build_restore=True) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/ python3.6/site-packages/tensorflow/python/training/saver.py", line 1357, in _build build_save=build_save, build_restore=build_restore) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/ python3.6/site-packages/tensorflow/python/training/saver.py", line 809, in _build_internal restore_sequentially, reshape) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/ python3.6/site-packages/tensorflow/python/training/saver.py", line 448, in _AddRestoreOps restore_sequentially) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/ python3.6/site-packages/tensorflow/python/training/saver.py", line 860, in bulk_restore return io_ops.restore_v2(filename_tensor, names, slices, dtypes) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/ python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1458, in restore_v2 shape_and_slices=shape_and_slices, dtypes=dtypes, name=name) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/ python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/ python3.6/site-packages/tensorflow/python/framework/ops.py", line 3290, in create_op op_def=op_def) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/ python3.6/site-packages/tensorflow/python/framework/ops.py", line 1654, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

NotFoundError (see above for traceback): Tensor name "emb/char/conv/xx/conv1d_5/bias/ExponentialMovingAverage" not found in checkpoint files save/37/save [[Node: save_1/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save_1/Const_0_0, save_1/RestoreV2/tensor_names, save_1/RestoreV2/shape_and_slices)]]

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/allenai/bi-att-flow/issues/69#issuecomment-383354076, or mute the thread https://github.com/notifications/unsubscribe-auth/AIUuT316hsR6-CgShw8-V4mqrdhflRNNks5trANygaJpZM4RU7Ny .

David-Levinthal avatar Apr 22 '18 15:04 David-Levinthal