lingvo icon indicating copy to clipboard operation
lingvo copied to clipboard

Unable to run LM DenseLm128B8x8

Open chadwickleung opened this issue 4 years ago • 0 comments

Hi,

I'm interested in seeing the timing relationship between layers and operations of the lm denselm128b8x8 model. I was using a v3-8 node and changed the hparams (num_device_per_split, mesh_shape) in Task() and also changed the num of training steps to save time.

However, after initializing the vars, I got this error (here is the log):

I0810 20:39:44.510699 140688498882304 checkpointer.py:236] Initialized all vars. I0810 20:39:44.513259 140688498882304 executor.py:400] Compiling 1 programs in parallel. I0810 20:39:44.513607 140688056727296 base_runner.py:120] Init inputs TrainProgram I0810 20:39:44.513836 140688056727296 base_runner.py:120] Init inputs TrainProgram done. I0810 20:39:44.513968 140688056727296 base_runner.py:120] Compiling TrainProgram I0810 20:41:20.058200 140688498882304 executor.py:422] Retrieve params. I0810 20:41:20.058756 140688498882304 executor.py:424] Retrieve params done. I0810 20:41:20.058939 140688498882304 checkpointer.py:181] Save checkpoint 2021-08-10 20:51:18.167557: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:159] RPC failed with status = "Unavai lable: Socket closed" and grpc_error_string = "{"created":"@1628628678.167318634","description":"Error received from peer ipv4:10 .70.9.74:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed ","grpc_status":14}", maybe retrying the RPC Exception in thread SessionCloseThread: Traceback (most recent call last): File "/usr/lib/python3.9/threading.py", line 973, in _bootstrap_inner self.run() File "/usr/lib/python3.9/threading.py", line 910, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/client/session.py", line 765, in close tf_session.TF_CloseSession(self._session) tensorflow.python.framework.errors_impl.AbortedError: Session 0c4ba843fb27c733 is not found. Possibly, this master has restarted. I0810 20:51:41.948596 140688498882304 base_runner.py:120] Job : Retrying as expected executor_tpu exception: Session 0c4ba843fb27c733 is not found. I0810 20:51:42.952887 140688498882304 retry.py:62] Retry: caught exception: _RunLoop while running tensorflow.python.framework.er rors_impl.AbortedError: Session 0c4ba843fb27c733 is not found. . Call failed at (most recent call last): File "/usr/lib/python3.9/threading.py", line 930, in _bootstrap self._bootstrap_inner() File "/usr/lib/python3.9/threading.py", line 973, in _bootstrap_inner self.run() File "/usr/lib/python3.9/threading.py", line 910, in run self._target(*self._args, **self._kwargs) Traceback for above exception (most recent call last): File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/main/bazel-out/k8-opt/bin/lingvo/trainer.run files/main/lingvo/core/retry.py", line 49, in Wrapper return func(*args, **kwargs) File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/main/bazel-out/k8-opt/bin/lingvo/trainer.run files/main/lingvo/base_runner.py", line 228, in _RunLoop loop_func(*loop_args) File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/main/bazel-out/k8-opt/bin/lingvo/trainer.run files/main/lingvo/executor.py", line 441, in _Loop RunSave(sess, global_step) File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/main/bazel-out/k8-opt/bin/lingvo/trainer.run files/main/lingvo/executor.py", line 430, in RunSave self.save_only_checkpointer.Save(sess, global_step) File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/main/bazel-out/k8-opt/bin/lingvo/trainer.run files/main/lingvo/core/checkpointer.py", line 182, in Save path = self._saver.save(sess, self._save_path, gsteps) File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/training/saver.py", line 1188, in save model_checkpoint_path = sess.run( File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/client/session.py", line 967, in run result = self._run(None, fetches, feed_dict, options_ptr, File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/client/session.py", line 1190, in _run results = self._do_run(handle, final_targets, final_fetches, File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/client/session.py", line 1368, in _do_run return self._do_call(_run_fn, feeds, fetches, targets, options, File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/client/session.py", line 1394, in _do_call raise type(e)(node_def, op, message) Waiting for 1.53 seconds before retrying. I0810 20:51:42.953114 140688498882304 base_runner.py:227] executor_tpu started.

chadwickleung avatar Aug 10 '21 21:08 chadwickleung