kwyk icon indicating copy to clipboard operation
kwyk copied to clipboard

what could have lead to CUDNN_STATUS_INTERNAL_ERROR ?

Open yarikoptic opened this issue 5 years ago • 2 comments

It used to work on my laptop, but no longer. I fear it is due to some interaction with GPU being used as an actual graphics card as well, and thus Xorg consuming too much memory (but requested ~1.3GB is less than available free ~2GB) or something like that

nvidia-smi
$> nvidia-smi
Mon Nov 11 09:55:21 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro T2000        Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   43C    P8     3W /  N/A |   2297MiB /  3911MiB |     19%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     21824      G   /usr/lib/xorg/Xorg                           141MiB |
|    0     25467      G   /usr/lib/xorg/Xorg                          1670MiB |
|    0     25596      G   /usr/bin/gnome-shell                         180MiB |
|    0     27333      G   ...uest-channel-token=14439694130078186709   232MiB |
|    0     28802      G   /usr/lib/xorg/Xorg                             6MiB |
|    0     28899      G   /usr/bin/gnome-shell                           5MiB |
+-----------------------------------------------------------------------------+
the actual run via singularity
$> singularity run -e -B /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.430.50 -B /usr/lib/x86_64-linux-gnu/libcuda.so.1 neuronets-kwyk--version-0.4-gpu.sing raiders/sub-rid000005/anat/sub-rid000005_run-01_T1w.nii.gz out
Bayesian dropout functions have been loaded.
Your version: v0.4 Latest version: 0.4
++ Conforming volume to 1mm^3 voxels and size 256x256x256.
/opt/kwyk/freesurfer/bin/mri_convert: line 2: /opt/kwyk/freesurfer/sources.sh: No such file or directory
mri_convert.bin --conform raiders/sub-rid000005/anat/sub-rid000005_run-01_T1w.nii.gz /tmp/tmpwtickiw9.nii.gz 
$Id: mri_convert.c,v 1.226 2016/02/26 16:15:24 mreuter Exp $
reading from raiders/sub-rid000005/anat/sub-rid000005_run-01_T1w.nii.gz...
TR=10.00, TE=0.00, TI=0.00, flip angle=0.00
i_ras = (0, -1, 0)
j_ras = (0, 0, 1)
k_ras = (1, 0, 0)
changing data type from float to uchar (noscale = 0)...
MRIchangeType: Building histogram 
Reslicing using trilinear interpolation 
writing to /tmp/tmpwtickiw9.nii.gz...
++ Running forward pass of model.
2019-11-11 14:57:43.820728: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-11 14:57:43.916219: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-11 14:57:43.916394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Quadro T2000 major: 7 minor: 5 memoryClockRate(GHz): 1.5
pciBusID: 0000:01:00.0
totalMemory: 3.82GiB freeMemory: 1.41GiB
2019-11-11 14:57:43.916409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-11-11 14:57:44.267550: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-11 14:57:44.267570: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-11-11 14:57:44.267575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-11-11 14:57:44.267684: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1246 MB memory) -> physical GPU (device: 0, name: Quadro T2000, pci bus id: 0000:01:00.0, compute capability: 7.5)
Normalizer being used <function zscore at 0x7fe98eac4ea0>
-5.8382284e-08
1.0000015
 0/64 [..............................] - ETA: 0s2019-11-11 14:57:46.303925: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-11-11 14:57:46.314172: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node layer_1/conv3d/Conv3D}} = Conv3D[T=DT_FLOAT, data_format="NDHWC", dilations=[1, 1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1, 1], _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_Placeholder_0_0/_85, layer_1/conv3d/kernel_m)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/kwyk", line 11, in <module>
    load_entry_point('kwyk', 'console_scripts', 'kwyk')()
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/opt/kwyk/kwyk/cli.py", line 92, in predict
    normalizer=zscore)
  File "/usr/local/lib/python3.5/dist-packages/nobrainer/predict.py", line 348, in predict_from_filepath
    batch_size=batch_size)
  File "/usr/local/lib/python3.5/dist-packages/nobrainer/predict.py", line 275, in predict_from_img
    batch_size=batch_size)
  File "/usr/local/lib/python3.5/dist-packages/nobrainer/predict.py", line 186, in predict_from_array
    new_prediction = predictor( {'volume': features[j:j + batch_size]})
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/predictor/predictor.py", line 77, in __call__
    return self._session.run(fetches=self.fetch_tensors, feed_dict=feed_dict)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node layer_1/conv3d/Conv3D (defined at /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/predictor/saved_model_predictor.py:153)  = Conv3D[T=DT_FLOAT, data_format="NDHWC", dilations=[1, 1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1, 1], _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_Placeholder_0_0/_85, layer_1/conv3d/kernel_m)]]

Caused by op 'layer_1/conv3d/Conv3D', defined at:
  File "/usr/local/bin/kwyk", line 11, in <module>
    load_entry_point('kwyk', 'console_scripts', 'kwyk')()
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/opt/kwyk/kwyk/cli.py", line 83, in predict
    predictor = _get_predictor(savedmodel_path)
  File "/usr/local/lib/python3.5/dist-packages/nobrainer/predict.py", line 406, in _get_predictor
    predictor = tf.contrib.predictor.from_saved_model(str(path))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/predictor/predictor_factories.py", line 153, in from_saved_model
    config=config)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/predictor/saved_model_predictor.py", line 153, in __init__
    loader.load(self._session, tags.split(','), export_dir)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/saved_model/loader_impl.py", line 197, in load
    return loader.load(sess, tags, import_scope, **saver_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/saved_model/loader_impl.py", line 350, in load
    **saver_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/saved_model/loader_impl.py", line 278, in load_graph
    meta_graph_def, import_scope=import_scope, **saver_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1696, in _import_meta_graph_with_return_elements
    **kwargs))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/meta_graph.py", line 806, in import_scoped_meta_graph_with_return_elements
    return_elements=return_elements)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/importer.py", line 442, in import_graph_def
    _ProcessNewOps(graph)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/importer.py", line 234, in _ProcessNewOps
    for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3440, in _add_new_tf_operations
    for c_op in c_api_util.new_tf_operations(self)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3440, in <listcomp>
    for c_op in c_api_util.new_tf_operations(self)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3299, in _create_op_from_tf_operation
    ret = Operation(c_op, self)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node layer_1/conv3d/Conv3D (defined at /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/predictor/saved_model_predictor.py:153)  = Conv3D[T=DT_FLOAT, data_format="NDHWC", dilations=[1, 1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1, 1], _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_Placeholder_0_0/_85, layer_1/conv3d/kernel_m)]]

yarikoptic avatar Nov 11 '19 14:11 yarikoptic

instead of this:

singularity run -e -B /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.430.50 \
-B /usr/lib/x86_64-linux-gnu/libcuda.so.1 neuronets-kwyk--version-0.4-gpu.sing \
raiders/sub-rid000005/anat/sub-rid000005_run-01_T1w.nii.gz out

can you try:

singularity run -e --nv neuronets-kwyk--version-0.4-gpu.sing \
raiders/sub-rid000005/anat/sub-rid000005_run-01_T1w.nii.gz out

satra avatar Nov 11 '19 17:11 satra

with --nv it used to halt, now (there is a bit more of free memory) it proceeds to the same crash.

I found http://tuxvoid.blogspot.com/2017/08/tensorflow-could-not-create-cudnn.html referenced from https://github.com/tensorflow/tensorflow/issues/14048 suggesting that instructing tensor flow to allow_grouth

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)

might help, but I could not figure out where in kwyk or nobrainer to tune that.

yarikoptic avatar Nov 11 '19 20:11 yarikoptic