I can run mnist task on cpu, but when I run it on gpu, it report " Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR"
I use latest docker and my gpu is rtx2070.
comamnd is :
bazel-out/k8-py2-opt/bin/lingvo/trainer --run_locally=gpu --mode=sync --model=image.mnist.LeNet5 --logdir=/tmp/mnist/log --logtostderr --enable_asserts=false 2>&1 | tee mnist.txt
log is :
W0905 11:39:58.339700 139847289272064 meta_graph.py:448] Issue encountered when serializing __model_split_id_stack.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'list' object has no attribute 'name'
I0905 11:39:58.457564 139847289272064 checkpointer.py:100] Save checkpoint done: /tmp/mnist/log/train/ckpt-00000000
2019-09-05 11:39:59.445700: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-09-05 11:39:59.690922: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-09-05 11:40:00.267407: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-09-05 11:40:00.269895: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E0905 11:40:00.310058 139847280879360 base_runner.py:212] trainer done (fatal error): <class 'tensorflow.python.framework.errors_impl.UnknownError'>
I0905 11:40:00.315107 139847280879360 base_runner.py:106] trainer exception: From /job:local/replica:0/task:0:
Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node fprop/lenet5/tower_0_0/conv0/convolution (defined at usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
.......
Original stack trace for u'fprop/lenet5/tower_0_0/conv0/convolution':
File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1771, in
tf.app.run(main)
File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "usr/local/lib/python2.7/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "usr/local/lib/python2.7/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1767, in main
RunnerManager(FLAGS.model).Start()
File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1760, in Start
self.StartRunners(self.CreateRunners(FLAGS.job.split(','), FLAGS.logdir))
File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1526, in CreateRunners
trial)
File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1483, in _CreateRunner
return self.Trainer(cfg, *common_args)
File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 392, in init
self._model.ConstructFPropBPropGraph()
File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 1201, in ConstructFPropBPropGraph
self._task.FPropDefaultTheta()
File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 545, in FPropDefaultTheta
return self.FProp(self.theta, input_batch)
File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 464, in FProp
metrics, per_example = self._FPropSplitInputBatch(theta, input_batch)
File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 510, in _FPropSplitInputBatch
metrics, per_example = self.FPropTower(theta_local, batch)
File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/tasks/image/classifier.py", line 165, in FPropTower
act, _ = self.conv[i].FProp(theta.conv[i], act)
File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/core/layers.py", line 603, in FProp
out = self._Compute(theta, inputs, paddings, conv_padding)
File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/core/layers.py", line 633, in _Compute
out = self._ApplyConv(theta, inputs, bn_padding_expanded)
File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/core/layers.py", line 534, in _ApplyConv
out = ComputeRawConvolution(filter_w)
File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/core/layers.py", line 525, in ComputeRawConvolution
padding_algorithm=padding_algorithm)
File "tmp/lingvo/bazel-out/k8-py2-opt/bin/lingvo/trainer.runfiles/main/lingvo/core/layers.py", line 694, in _EvaluateConvKernel
padding=padding_algorithm)
File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 898, in convolution
name=name)
File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 1009, in convolution_internal
name=name)
File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 1071, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 793, in _apply_op_helper
op_def=op_def)
File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in init
self._traceback = tf_stack.extract_stack()
Seems to be a common problem with RTX cards
https://github.com/tensorflow/tensorflow/issues/24496
according to that thread setting allow_growth=True fixes it