n2v copied to clipboard
"Unknown: Failed to get convolution algorithm." error and how to solve it
I spent yesterday afternoon installing n2v on a new machine (Ryzen 5, RTX 2060, Ubuntu 20.04, conda) and ran into tensorflow-related issues with n2v. The error occurs when running training in any of the example notebooks. The error message was cuDNN related (see full traceback below), so I suspected a library version problem. I tried various versions of tensorflow-gpu such as 1.14, 1.15 and versions installed with pip or from conda using the anaconda and conda-forge channels. Also tried various versions of the CUDA toolkit and python 3.6 and 3.7. All without success.
While there was enough GPU VRAM available, it turns out that this is related to GPU memory management in tensorflow. Setting the following environment variable
fixed the issue. This is not specific to n2v
, in fact I found the answer in a thread related to DeepLabCut (https://forum.image.sc/t/could-not-create-cudnn-handle/24276/17).
I am putting this here so that others who run into the issue can find it. I am not sure how common it is (I did not encounter the issue when installing n2v on Windows) and whether it warrants mentioning in the README.md file.
Preparing validation data: 0%| | 0/544 [00:00<?, ?it/s]
8 blind-spots will be generated per training patch of size (64, 64).
Preparing validation data: 100%|██████████| 544/544 [00:00<00:00, 1512.18it/s]
WARNING:tensorflow:From /home/volker/miniconda3/envs/n2v/lib/python3.7/site-packages/Keras-2.2.5-py3.7.egg/keras/backend/tensorflow_backend.py:1033: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.
WARNING:tensorflow:From /home/volker/miniconda3/envs/n2v/lib/python3.7/site-packages/Keras-2.2.5-py3.7.egg/keras/backend/tensorflow_backend.py:1020: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
* https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
WARNING:tensorflow:From /home/volker/miniconda3/envs/n2v/lib/python3.7/site-packages/csbdeep-0.5.2-py3.7.egg/csbdeep/utils/tf.py:245: The name tf.summary.image is deprecated. Please use tf.compat.v1.summary.image instead.
WARNING:tensorflow:From /home/volker/miniconda3/envs/n2v/lib/python3.7/site-packages/csbdeep-0.5.2-py3.7.egg/csbdeep/utils/tf.py:273: The name tf.summary.merge is deprecated. Please use tf.compat.v1.summary.merge instead.
WARNING:tensorflow:From /home/volker/miniconda3/envs/n2v/lib/python3.7/site-packages/csbdeep-0.5.2-py3.7.egg/csbdeep/utils/tf.py:280: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.
Epoch 1/10
UnknownError Traceback (most recent call last)
<ipython-input-10-147763b6fb69> in <module>
1 # We are ready to start training now.
----> 2 history = model.train(X, X_val)
~/miniconda3/envs/n2v/lib/python3.7/site-packages/n2v-0.2.1-py3.7.egg/n2v/models/n2v_standard.py in train(self, X, validation_X, epochs, steps_per_epoch)
238 history = self.keras_model.fit_generator(generator=training_data, validation_data=(validation_X, validation_Y),
239 epochs=epochs, steps_per_epoch=steps_per_epoch,
--> 240 callbacks=self.callbacks, verbose=1)
242 if self.basedir is not None:
~/miniconda3/envs/n2v/lib/python3.7/site-packages/Keras-2.2.5-py3.7.egg/keras/legacy/interfaces.py in wrapper(*args, **kwargs)
89 warnings.warn('Update your `' + object_name + '` call to the ' +
90 'Keras 2 API: ' + signature, stacklevel=2)
---> 91 return func(*args, **kwargs)
92 wrapper._original_function = func
93 return wrapper
~/miniconda3/envs/n2v/lib/python3.7/site-packages/Keras-2.2.5-py3.7.egg/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
1656 use_multiprocessing=use_multiprocessing,
1657 shuffle=shuffle,
-> 1658 initial_epoch=initial_epoch)
1660 @interfaces.legacy_generator_methods_support
~/miniconda3/envs/n2v/lib/python3.7/site-packages/Keras-2.2.5-py3.7.egg/keras/engine/training_generator.py in fit_generator(model, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
213 outs = model.train_on_batch(x, y,
214 sample_weight=sample_weight,
--> 215 class_weight=class_weight)
217 outs = to_list(outs)
~/miniconda3/envs/n2v/lib/python3.7/site-packages/Keras-2.2.5-py3.7.egg/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight)
1447 ins = x + y + sample_weights
1448 self._make_train_function()
-> 1449 outputs = self.train_function(ins)
1450 return unpack_singleton(outputs)
~/miniconda3/envs/n2v/lib/python3.7/site-packages/Keras-2.2.5-py3.7.egg/keras/backend/tensorflow_backend.py in __call__(self, inputs)
2977 return self._legacy_call(inputs)
-> 2979 return self._call(inputs)
2980 else:
2981 if py_any(is_tensor(x) for x in inputs):
~/miniconda3/envs/n2v/lib/python3.7/site-packages/Keras-2.2.5-py3.7.egg/keras/backend/tensorflow_backend.py in _call(self, inputs)
2935 fetched = self._callable_fn(*array_vals, run_metadata=self.run_metadata)
2936 else:
-> 2937 fetched = self._callable_fn(*array_vals)
2938 return fetched[:len(self.outputs)]
~/miniconda3/envs/n2v/lib/python3.7/site-packages/tensorflow_core/python/client/session.py in __call__(self, *args, **kwargs)
1470 ret = tf_session.TF_SessionRunCallable(self._session._session,
1471 self._handle, args,
-> 1472 run_metadata_ptr)
1473 if run_metadata:
1474 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node channel_0down_level_0_no_0/convolution}}]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node channel_0down_level_0_no_0/convolution}}]]
0 successful operations.
0 derived errors ignored.
ooh that's good to know! @VolkerH I always get this issue when I run the script on a computer with 2GB GPU while Google Colab has no issues. The error message doesn't appear when I just run the script again without doing anything.
Would you mind explaining what TF_FORCE_GPU_ALLOW_GROWTH=true does? I also have a GPU memory issue such that I have to restart the kernel between training and prediction. Not to mention the above issue, which I am hoping your solution fixes my case.
Are you able to run N2V training and prediction at one run without having to restart the kernel? I am wondering if there is a way to clear GPU memory programmatically after training is complete.
There is some background to how the environment variable changes the behaviour here:
https://www.tensorflow.org/guide/gpu under the subheading "Limiting GPU memory growth".
I cannot both have the training and prediction notebook running at the same time due to memory limitations. However, if you still have the model in memory (after training) rather than trying to create a new additional model I believe this could fix it.
My understanding is that without this option, tensorflow grabs pretty much all available GPU memory initially. If you then even need to allocate a tiny bit more it will fail. I also don't understand what other parameters affect it. As I mentioned, on a different machine (windows) with less GPU VRAM I do not need to set any option like that and I do not see the error.
Hi @VolkerH,
Thank you for reporting this issue here as well. I am wondering if there might be an issue with the compatibility between keras, tensorflow, CUDA, and nvidia-driver versions. I will put a note into the README.md with your proposed fix.
@citypalmtree interesting that your issue disappears if you rerun the notebook without changes. Do you restart the kernel between runs?
Regarding training- and prediction-notebooks: It is like @VolkerH says, as long as you stay in the same kernel/notebook you can run training and prediction sequentially. But unfortunately tensorflow allocates all available GPU memory (if not told otherwise), which means that a second kernel/notebook will find no GPU memory left and fail.
One way to limit GPU memory would be this:
from csbdeep.utils.tf import limit_gpu_memory
This would only allocate 50% of the GPU memory.
Doing this would allow you to train two models in parallel on one GPU, but most likely it will take much longer. First you can only use half of the data and second data for two independent models has to be transferred to the GPU, which could lead to inefficient processing.