lingvo icon indicating copy to clipboard operation
lingvo copied to clipboard

mnist task fail on the gpu

Open al3chen opened this issue 6 years ago • 2 comments

I can run mnist task on cpu, but when I run it on gpu, it report " Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR"

I use latest docker and my gpu is rtx2070.

comamnd is : bazel-out/k8-py2-opt/bin/lingvo/trainer --run_locally=gpu --mode=sync --model=image.mnist.LeNet5 --logdir=/tmp/mnist/log --logtostderr --enable_asserts=false 2>&1 | tee mnist.txt

log is :

W0905 11:39:58.339700 139847289272064 meta_graph.py:448] Issue encountered when serializing __model_split_id_stack. Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore. 'list' object has no attribute 'name' I0905 11:39:58.457564 139847289272064 checkpointer.py:100] Save checkpoint done: /tmp/mnist/log/train/ckpt-00000000 2019-09-05 11:39:59.445700: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2019-09-05 11:39:59.690922: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2019-09-05 11:40:00.267407: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2019-09-05 11:40:00.269895: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR E0905 11:40:00.310058 139847280879360 base_runner.py:212] trainer done (fatal error): <class 'tensorflow.python.framework.errors_impl.UnknownError'> I0905 11:40:00.315107 139847280879360 base_runner.py:106] trainer exception: From /job:local/replica:0/task:0: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node fprop/lenet5/tower_0_0/conv0/convolution (defined at usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] .......

Original stack trace for u'fprop/lenet5/tower_0_0/conv0/convolution': File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1771, in tf.app.run(main) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "usr/local/lib/python2.7/dist-packages/absl/app.py", line 299, in run _run_main(main, args) File "usr/local/lib/python2.7/dist-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1767, in main RunnerManager(FLAGS.model).Start() File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1760, in Start self.StartRunners(self.CreateRunners(FLAGS.job.split(','), FLAGS.logdir)) File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1526, in CreateRunners trial) File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1483, in _CreateRunner return self.Trainer(cfg, *common_args) File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 392, in init self._model.ConstructFPropBPropGraph() File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 1201, in ConstructFPropBPropGraph self._task.FPropDefaultTheta() File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 545, in FPropDefaultTheta return self.FProp(self.theta, input_batch) File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 464, in FProp metrics, per_example = self._FPropSplitInputBatch(theta, input_batch) File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 510, in _FPropSplitInputBatch metrics, per_example = self.FPropTower(theta_local, batch) File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/tasks/image/classifier.py", line 165, in FPropTower act, _ = self.conv[i].FProp(theta.conv[i], act) File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/core/layers.py", line 603, in FProp out = self._Compute(theta, inputs, paddings, conv_padding) File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/core/layers.py", line 633, in _Compute out = self._ApplyConv(theta, inputs, bn_padding_expanded) File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/core/layers.py", line 534, in _ApplyConv out = ComputeRawConvolution(filter_w) File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/core/layers.py", line 525, in ComputeRawConvolution padding_algorithm=padding_algorithm) File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/core/layers.py", line 694, in _EvaluateConvKernel padding=padding_algorithm) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 898, in convolution name=name) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 1009, in convolution_internal name=name) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 1071, in conv2d data_format=data_format, dilations=dilations, name=name) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 793, in _apply_op_helper op_def=op_def) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in init self._traceback = tf_stack.extract_stack()

al3chen avatar Sep 05 '19 11:09 al3chen

Seems to be a common problem with RTX cards

https://github.com/tensorflow/tensorflow/issues/24496

according to that thread setting allow_growth=True fixes it

jonathanasdf avatar Sep 05 '19 21:09 jonathanasdf

thanks

al3chen avatar Sep 06 '19 11:09 al3chen