kwyk
kwyk copied to clipboard
what could have lead to CUDNN_STATUS_INTERNAL_ERROR ?
It used to work on my laptop, but no longer. I fear it is due to some interaction with GPU being used as an actual graphics card as well, and thus Xorg consuming too much memory (but requested ~1.3GB is less than available free ~2GB) or something like that
nvidia-smi
$> nvidia-smi
Mon Nov 11 09:55:21 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro T2000 Off | 00000000:01:00.0 Off | N/A |
| N/A 43C P8 3W / N/A | 2297MiB / 3911MiB | 19% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 21824 G /usr/lib/xorg/Xorg 141MiB |
| 0 25467 G /usr/lib/xorg/Xorg 1670MiB |
| 0 25596 G /usr/bin/gnome-shell 180MiB |
| 0 27333 G ...uest-channel-token=14439694130078186709 232MiB |
| 0 28802 G /usr/lib/xorg/Xorg 6MiB |
| 0 28899 G /usr/bin/gnome-shell 5MiB |
+-----------------------------------------------------------------------------+
the actual run via singularity
$> singularity run -e -B /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.430.50 -B /usr/lib/x86_64-linux-gnu/libcuda.so.1 neuronets-kwyk--version-0.4-gpu.sing raiders/sub-rid000005/anat/sub-rid000005_run-01_T1w.nii.gz out
Bayesian dropout functions have been loaded.
Your version: v0.4 Latest version: 0.4
++ Conforming volume to 1mm^3 voxels and size 256x256x256.
/opt/kwyk/freesurfer/bin/mri_convert: line 2: /opt/kwyk/freesurfer/sources.sh: No such file or directory
mri_convert.bin --conform raiders/sub-rid000005/anat/sub-rid000005_run-01_T1w.nii.gz /tmp/tmpwtickiw9.nii.gz
$Id: mri_convert.c,v 1.226 2016/02/26 16:15:24 mreuter Exp $
reading from raiders/sub-rid000005/anat/sub-rid000005_run-01_T1w.nii.gz...
TR=10.00, TE=0.00, TI=0.00, flip angle=0.00
i_ras = (0, -1, 0)
j_ras = (0, 0, 1)
k_ras = (1, 0, 0)
changing data type from float to uchar (noscale = 0)...
MRIchangeType: Building histogram
Reslicing using trilinear interpolation
writing to /tmp/tmpwtickiw9.nii.gz...
++ Running forward pass of model.
2019-11-11 14:57:43.820728: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-11 14:57:43.916219: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-11 14:57:43.916394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Quadro T2000 major: 7 minor: 5 memoryClockRate(GHz): 1.5
pciBusID: 0000:01:00.0
totalMemory: 3.82GiB freeMemory: 1.41GiB
2019-11-11 14:57:43.916409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-11-11 14:57:44.267550: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-11 14:57:44.267570: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-11-11 14:57:44.267575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-11-11 14:57:44.267684: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1246 MB memory) -> physical GPU (device: 0, name: Quadro T2000, pci bus id: 0000:01:00.0, compute capability: 7.5)
Normalizer being used <function zscore at 0x7fe98eac4ea0>
-5.8382284e-08
1.0000015
0/64 [..............................] - ETA: 0s2019-11-11 14:57:46.303925: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-11-11 14:57:46.314172: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node layer_1/conv3d/Conv3D}} = Conv3D[T=DT_FLOAT, data_format="NDHWC", dilations=[1, 1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1, 1], _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_Placeholder_0_0/_85, layer_1/conv3d/kernel_m)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/kwyk", line 11, in <module>
load_entry_point('kwyk', 'console_scripts', 'kwyk')()
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/opt/kwyk/kwyk/cli.py", line 92, in predict
normalizer=zscore)
File "/usr/local/lib/python3.5/dist-packages/nobrainer/predict.py", line 348, in predict_from_filepath
batch_size=batch_size)
File "/usr/local/lib/python3.5/dist-packages/nobrainer/predict.py", line 275, in predict_from_img
batch_size=batch_size)
File "/usr/local/lib/python3.5/dist-packages/nobrainer/predict.py", line 186, in predict_from_array
new_prediction = predictor( {'volume': features[j:j + batch_size]})
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/predictor/predictor.py", line 77, in __call__
return self._session.run(fetches=self.fetch_tensors, feed_dict=feed_dict)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node layer_1/conv3d/Conv3D (defined at /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/predictor/saved_model_predictor.py:153) = Conv3D[T=DT_FLOAT, data_format="NDHWC", dilations=[1, 1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1, 1], _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_Placeholder_0_0/_85, layer_1/conv3d/kernel_m)]]
Caused by op 'layer_1/conv3d/Conv3D', defined at:
File "/usr/local/bin/kwyk", line 11, in <module>
load_entry_point('kwyk', 'console_scripts', 'kwyk')()
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/opt/kwyk/kwyk/cli.py", line 83, in predict
predictor = _get_predictor(savedmodel_path)
File "/usr/local/lib/python3.5/dist-packages/nobrainer/predict.py", line 406, in _get_predictor
predictor = tf.contrib.predictor.from_saved_model(str(path))
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/predictor/predictor_factories.py", line 153, in from_saved_model
config=config)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/predictor/saved_model_predictor.py", line 153, in __init__
loader.load(self._session, tags.split(','), export_dir)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/saved_model/loader_impl.py", line 197, in load
return loader.load(sess, tags, import_scope, **saver_kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/saved_model/loader_impl.py", line 350, in load
**saver_kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/saved_model/loader_impl.py", line 278, in load_graph
meta_graph_def, import_scope=import_scope, **saver_kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1696, in _import_meta_graph_with_return_elements
**kwargs))
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/meta_graph.py", line 806, in import_scoped_meta_graph_with_return_elements
return_elements=return_elements)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/importer.py", line 442, in import_graph_def
_ProcessNewOps(graph)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/importer.py", line 234, in _ProcessNewOps
for new_op in graph._add_new_tf_operations(compute_devices=False): # pylint: disable=protected-access
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3440, in _add_new_tf_operations
for c_op in c_api_util.new_tf_operations(self)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3440, in <listcomp>
for c_op in c_api_util.new_tf_operations(self)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3299, in _create_op_from_tf_operation
ret = Operation(c_op, self)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node layer_1/conv3d/Conv3D (defined at /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/predictor/saved_model_predictor.py:153) = Conv3D[T=DT_FLOAT, data_format="NDHWC", dilations=[1, 1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1, 1], _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_Placeholder_0_0/_85, layer_1/conv3d/kernel_m)]]
instead of this:
singularity run -e -B /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.430.50 \
-B /usr/lib/x86_64-linux-gnu/libcuda.so.1 neuronets-kwyk--version-0.4-gpu.sing \
raiders/sub-rid000005/anat/sub-rid000005_run-01_T1w.nii.gz out
can you try:
singularity run -e --nv neuronets-kwyk--version-0.4-gpu.sing \
raiders/sub-rid000005/anat/sub-rid000005_run-01_T1w.nii.gz out
with --nv it used to halt, now (there is a bit more of free memory) it proceeds to the same crash.
I found http://tuxvoid.blogspot.com/2017/08/tensorflow-could-not-create-cudnn.html referenced from
https://github.com/tensorflow/tensorflow/issues/14048 suggesting that instructing tensor flow to allow_grouth
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
might help, but I could not figure out where in kwyk or nobrainer to tune that.