lingvo icon indicating copy to clipboard operation
lingvo copied to clipboard

Error with Gpipe

Open xsppp opened this issue 4 years ago • 0 comments

@bignamehyp

Hi, when I try to run the Gpipe example one_billion_wds using the given command:

trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.OneBWdsGPipeTransformerWPM --logdir=/tmp/lm/log --logtostderr --worker_split_size=4 --worker_gpus=4

There is an error :

Traceback (most recent call last): File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1957, in tf.app.run(main) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 303, in run _run_main(main, args) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1941, in main RunnerManager(FLAGS.model).Start() File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1937, in Start self.StartRunners(self.CreateRunners(FLAGS.job.split(','), FLAGS.logdir)) File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1666, in CreateRunners runner = self._CreateRunner(j, FLAGS.model_task_name, logdir, tf_master, File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1617, in _CreateRunner return self.Controller(cfg, *common_args) File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 241, in init self._model.ConstructFPropBPropGraph() File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 1056, in ConstructFPropBPropGraph self._task.FPropDefaultTheta() File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 554, in FPropDefaultTheta return self.FProp(self.theta, input_batch) File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 471, in FProp metrics, per_example = self._FPropSplitInputBatch(theta, input_batch) File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 518, in _FPropSplitInputBatch metrics, per_example = self.FPropTower(theta_local, batch) File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/tasks/lm/model.py", line 267, in FPropTower xent_output, _ = self.lm.FProp(theta.lm, ids, paddings, state0, labels) File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/tasks/lm/layers.py", line 1308, in FProp per_example_xent, logits = self.stack.FProp(theta.stack, ids, paddings, File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/layers_with_gpipe.py", line 865, in FProp logits = super().FProp(theta, source_input, source_paddings, target_input, File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/gpipe.py", line 454, in FProp state_shapes = self._CalculateOutputShapes(input_shapes) File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/gpipe.py", line 366, in _CalculateOutputShapes shapes = py_utils.Transform(_ToTShape, input_shapes) File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/py_utils.py", line 810, in Transform return tf.nest.map_structure(fn, *v) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/nest.py", line 635, in map_structure structure[0], [func(*x) for x in entries], File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/nest.py", line 635, in structure[0], [func(*x) for x in entries], File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/gpipe.py", line 364, in _ToTShape return tshape.Shape(x.as_list()) File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/tshape.py", line 43, in init assert x is not None, str(dims) AssertionError: [1, None]

I have 4 Tesla V100-SXM2 GPUs. And when I run the one_billion example without Gpipe, it works. I don't know how to fix this problem. Could you please give some advice? THX

xsppp avatar Mar 24 '21 12:03 xsppp