jax Error message: CUBLAS_STATUS_NOT

nvcc --version yields

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

I installed jax with

pip install --upgrade pip
pip install --upgrade jax[cuda111] -f https://storage.googleapis.com/jax-releases/jax_releases.html

My code is

import jax.numpy as jnp
from jax import grad, jit, vmap
from jax import random

from time import time
key = random.PRNGKey(0)


def apply_matrix(v):
	global key
	mat = random.normal(key, (150, 100))
	return jnp.dot(mat, v)

def batched_apply_matrix(v_batched):
	return vmap(apply_matrix)(v_batched)

v_batched = random.normal(key, (10, 100))
batched_apply_matrix(v_batched).block_until_ready()

The program output is

2021-06-26 17:01:46.891583: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-06-26 17:01:46.891610: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gemm_algorithm_picker.cc:113] Check failed: stream->parent()->GetBlasGemmAlgorithms(&algorithms) 
Aborted (core dumped)

Jun 27 '21 00:06 noanabeshima

@noanabeshima I am experiencing the same issue. The code runs fine though when I paste it into a Google Colab and connect to a local run time. I am not sure why there is a difference.

Jun 27 '21 17:06 pharringtonp19

Can you set the environment variable TF_CPP_MIN_LOG_LOVEL=0, rerun, and send us the full output?

Also can you share the output of nvidia-smi?

Jun 30 '21 23:06 skye

Have you solved this problem. I meet the same problem.

2021-07-03 06:59:52.537404: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-03 06:59:54.543008: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-07-03 06:59:54.544110: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-07-03 06:59:54.576691: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-03 06:59:54.577708: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:00:08.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.72GiB deviceMemoryBandwidth: 836.37GiB/s
2021-07-03 06:59:54.577847: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-03 06:59:54.582512: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-07-03 06:59:54.582599: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-07-03 06:59:54.585051: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-07-03 06:59:54.585364: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-07-03 06:59:54.585940: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-07-03 06:59:54.586916: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-07-03 06:59:54.587069: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-07-03 06:59:54.587181: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-03 06:59:54.588232: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-03 06:59:54.589174: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
classifier: token
hidden_size: 1024
name: ViT-L_16
patches:
  size: !!python/tuple
  - 16
  - 16
representation_size: null
transformer:
  attention_dropout_rate: 0.0
  dropout_rate: 0.1
  mlp_dim: 4096
  num_heads: 16
  num_layers: 24

2021-07-03 06:59:58.442563: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:171] XLA service 0x6527e20 initialized for platform Interpreter (this does not guarantee that XLA will be used). Devices:
2021-07-03 06:59:58.442602: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:179]   StreamExecutor device (0): Interpreter, <undefined>
2021-07-03 06:59:58.446140: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/tfrt_cpu_pjrt_client.cc:160] TfrtCpuClient created.
INFO:absl:Starting the local TPU driver.
INFO:absl:Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local://
2021-07-03 06:59:58.533758: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-03 06:59:58.534812: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:171] XLA service 0x64348a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-07-03 06:59:58.534847: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:179]   StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
2021-07-03 06:59:58.535193: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/gpu_device.cc:298] Using BFC allocator.
2021-07-03 06:59:58.535261: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/gpu_device.cc:257] XLA backend allocating 30352696934 bytes on device 0 for BFCAllocator.
2021-07-03 06:59:58.535727: I external/org_tensorflow/tensorflow/stream_executor/tpu/tpu_platform_interface.cc:74] No TPU platform found.
INFO:absl:Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available.
2021-07-03 06:59:58.995298: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-07-03 06:59:58.995843: W external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:342] There was an error before creating cudnn handle: cudaErrorInitializationError : initialization error
2021-07-03 06:59:58.999101: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:374] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2021-07-03 06:59:58.999161: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:382] Possibly insufficient driver version: 418.67.0
2021-07-03 06:59:58.999651: W external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:342] There was an error before creating cudnn handle: cudaErrorInitializationError : initialization error
2021-07-03 06:59:58.999675: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:374] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2021-07-03 06:59:58.999696: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:382] Possibly insufficient driver version: 418.67.0
2021-07-03 06:59:58.999746: W external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:342] There was an error before creating cudnn handle: cudaErrorInitializationError : initialization error
2021-07-03 06:59:58.999755: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:374] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2021-07-03 06:59:58.999768: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:382] Possibly insufficient driver version: 418.67.0
2021-07-03 06:59:59.000039: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:691] Failed to determine best cudnn convolution algorithm: Internal: All algorithms tried for %custom-call = (f32[1,24,24,1024]{2,1,3,0}, u8[0]{0}) custom-call(f32[1,384,384,3]{2,1,3,0} %copy.3, f32[16,16,3,1024]{1,0,2,3} %copy.4), window={size=16x16 stride=16x16}, dim_labels=b01f_01io->b01f, custom_call_target="__cudnn$convForward", metadata={op_type="conv_general_dilated" op_name="conv_general_dilated[ batch_group_count=1\n                      dimension_numbers=ConvDimensionNumbers(lhs_spec=(0, 3, 1, 2), rhs_spec=(3, 2, 0, 1), out_spec=(0, 3, 1, 2))\n                      feature_group_count=1\n                      lhs_dilation=(1, 1)\n                      lhs_shape=(1, 384, 384, 3)\n                      padding=((0, 0), (0, 0))\n                      precision=None\n                      preferred_element_type=None\n                      rhs_dilation=(1, 1)\n                      rhs_shape=(16, 16, 3, 1024)\n                      window_strides=(16, 16) ]"}, backend_config="{\"algorithm\":\"0\",\"tensor_ops_enabled\":false,\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}" failed. Falling back to default algorithm. 

Convolution performance may be suboptimal.
2021-07-03 06:59:59.062658: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_asm_compiler.cc:63] cuLinkAddData fails. This is usually caused by stale driver version.
2021-07-03 06:59:59.062716: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc:895] The CUDA linking API did not work. Please use XLA_FLAGS=--xla_gpu_force_compilation_parallelism=1 to bypass it, but expect to get longer compilation time due to the lack of multi-threading.
Traceback (most recent call last):
  File "run_cpu.py", line 89, in <module>
    vit_model = ViTL16(data_root=os.path.join(data_root, "vision_transformer"))
  File "run_cpu.py", line 66, in __init__
    logits, feature = model.apply(dict(params=params), (np.array(img_cv) / 128. - 1)[None, ...], train=False)
  File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/flax/linen/module.py", line 914, in apply
    )(variables, *args, **kwargs, rngs=rngs)
  File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/flax/core/scope.py", line 675, in wrapper
    y = fn(root, *args, **kwargs)
  File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/flax/linen/module.py", line 1135, in scope_fn
    return fn(module.clone(parent=scope), *args, **kwargs)
  File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/flax/linen/module.py", line 271, in wrapped_module_method
    y = fun(self, *args, **kwargs)
  File "/home/tione/notebook/vision_transformer/vit_jax/models.py", line 264, in __call__
    x)
  File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/flax/linen/module.py", line 271, in wrapped_module_method
    y = fun(self, *args, **kwargs)
  File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/flax/linen/linear.py", line 278, in __call__
    precision=self.precision)
  File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/jax/_src/lax/lax.py", line 633, in conv_general_dilated
    preferred_element_type=preferred_element_type)
  File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/jax/core.py", line 264, in bind
    out = top_trace.process_primitive(self, tracers, params)
  File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/jax/core.py", line 603, in process_primitive
    return primitive.impl(*tracers, **params)
  File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/jax/interpreters/xla.py", line 248, in apply_primitive
    compiled_fun = xla_primitive_callable(prim, *unsafe_map(arg_spec, args), **params)
  File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/jax/_src/util.py", line 186, in wrapper
    return cached(config._trace_context(), *args, **kwargs)
  File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/jax/_src/util.py", line 179, in cached
    return f(*args, **kwargs)
  File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/jax/interpreters/xla.py", line 297, in xla_primitive_callable
    compiled = backend_compile(backend, built_c, options)
  File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/jax/interpreters/xla.py", line 360, in backend_compile
    return backend.compile(built_c, compile_options=options)
RuntimeError: Unknown: no kernel image is available for execution on the device
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_asm_compiler.cc(66): 'status'

Jul 02 '21 23:07 alexwongdl

@alexwongdl Can you share the output of nvidia-smi and your CUDA toolkit version (e.g., nvcc --version?) My guess is your NVidia driver is too old for your CUDA toolkit version.

Jul 07 '21 15:07 hawkinsp

@hawkinsp I think I have the same issue. My code:

import flax.linen as nn
import jax
import jax.numpy as jnp
import tensorflow_datasets as tfds


def main():
    ds, info = tfds.load('fashion_mnist', with_info=True)
    num_classes = info.features['label'].num_classes
    image_shape = info.features['image'].shape
    num_test = info.splits['test'].num_examples
    num_train = info.splits['train'].num_examples

    def pp(iter):
        """Preprocesses images/labels for use with JAX."""
        for batch in iter:
            yield (
                jnp.array(batch['image']) / 255.,
                jax.nn.one_hot(batch['label'], num_classes),
            )

    train_images, train_labels = next(pp(iter(ds['train'].batch(num_train))))
    test_images, test_labels = next(pp(iter(ds['test'].batch(num_test))))


    class Model(nn.Module):

        def setup(self):
            self.dense = nn.Dense(features=10)

        def __call__(self, x):
            batch_size = x.shape[0]
            x = x.reshape([batch_size, -1])
            x = self.dense(x)
            return nn.log_softmax(x)

    model = Model()
    rng = jax.random.PRNGKey(0)
    variables = model.init(rng, train_images[:1])
    params = variables['params']

    def evaluate(params):
        log_probs = model.apply({'params': params}, test_images)
        return (log_probs.argmax(axis=-1) == test_labels.argmax(axis=-1)).mean()
    print(evaluate(params))


if __name__=='__main__':
    main()

After running I am getting

2021-09-07 21:57:00.181437: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-09-07 21:57:00.181472: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gemm_algorithm_picker.cc:118] Check failed: stream->parent()->GetBlasGemmAlgorithms(&algorithms) 
Aborted (core dumped)

nvcc --version output:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jul_14_19:41:19_PDT_2021
Cuda compilation tools, release 11.4, V11.4.100
Build cuda_11.4.r11.4/compiler.30188945_0

nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| N/A   49C    P8    10W /  N/A |    383MiB /  7982MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1104      G   /usr/lib/xorg/Xorg                175MiB |
|    0   N/A  N/A      1479      G   /usr/bin/gnome-shell               37MiB |
|    0   N/A  N/A      2589      G   ...AAAAAAAAA= --shared-files       44MiB |
|    0   N/A  N/A      3897      G   ...AAAAAAAAA= --shared-files      122MiB |
+-----------------------------------------------------------------------------+

Sep 07 '21 19:09 N0r9st

I am facing the exact same problem with the latest GPU drivers. Sample code

from jax import jit
import jax.numpy as jnp
import scipy
import math

@jit
def hermitian(a):
    return jnp.conjugate(jnp.swapaxes(a, -1, -2))

def fourier_basis(n):
    """Fourier basis
    """
    F = scipy.linalg.dft(n) / math.sqrt(n)
    # From numpy to jax
    F = jnp.array(F)
    # Perform conjugate transpose
    F = hermitian(F)
    return F

N = 16
A = fourier_basis(N)

AH = hermitian(A)
# THIS LINE CRASHES
G = AH @ A

Error message after setting TF_CPP_MIN_LOG_LOVEL=0:

(base) shailesh@wks0018:~/work/cr-sparse/junk$ export TF_CPP_MIN_LOG_LOVEL=0
(base) shailesh@wks0018:~/work/cr-sparse/junk$ python test_fourier_gram.py
2021-10-23 21:16:31.589081: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-10-23 21:16:31.589118: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gemm_algorithm_picker.cc:118] Check failed: stream->parent()->GetBlasGemmAlgorithms(&algorithms)
Aborted (core dumped)

smi:

(base) shailesh@wks0018:~/work/cr-sparse/junk$ nvidia-smi
Sat Oct 23 21:21:46 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:02:00.0 Off |                  N/A |
|  0%   25C    P8     5W / 120W |   5480MiB /  6077MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2848      G   /usr/lib/xorg/Xorg                  9MiB |
|    0   N/A  N/A      3418      G   /usr/bin/gnome-shell                4MiB |
|    0   N/A  N/A     13682      C   ...lesh/anaconda3/bin/python     5461MiB |
+-----------------------------------------------------------------------------+

nvcc:

(base) shailesh@wks0018:~/work/cr-sparse/junk$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Sep_13_19:13:29_PDT_2021
Cuda compilation tools, release 11.5, V11.5.50
Build cuda_11.5.r11.5/compiler.30411180_0

I had used the exact command to install jaxlib

# Installs the wheel compatible with CUDA 11 and cuDNN 8.2 or newer.
pip install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_releases.html  # Note: wheels only available on linux.

JAX and JAXLIB versions:

(base) shailesh@wks0018:~/work/cr-sparse/junk$ python
Python 3.8.8 (default, Apr 13 2021, 19:58:26)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import jax
>>> jax.__version__
'0.2.24'
>>> import jaxlib
>>> jaxlib.__version__
'0.1.73'
>>>

Oct 23 '21 17:10 shailesh1729

Following #5380, when I tried out with export XLA_PYTHON_CLIENT_PREALLOCATE=false, it stopped crashing.

Oct 23 '21 17:10 shailesh1729

FYI, In my case, I had multiple cuda in the path environment - 10.2 / 11.1 / 11.3 and jax[cuda111]==0.2.20. It seems that it catches the latest version (i.e., 11.3). When I get rid of 11.3 in the path, it works fine.

Nov 09 '21 07:11 SuwoongHeo

I second the export XLA_PYTHON_CLIENT_PREALLOCATE=false solution.

In my case somehow a mix of using haiku + importing tensorflow (through a third party library) resulting in CUDNN_STATUS_EXECUTION_FAILED and a cryptic message of W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:691] Failed to determine best cudnn convolution algorithm. It's pretty weird since there examples in haiku that import Tensorflow and I can run them, but in my test case removing import tensorflow does work so there must be some weird interplay and memory allocation between these libraries that I don't quite understand.

Jan 13 '22 01:01 trivoldus28

When using Haiku together with TensorFlow datasets (tfds), I find it helpful to set TF_FORCE_GPU_ALLOW_GROWTH=true such that TF wouldn't preallocate loads of GPU memory.

Feb 15 '22 15:02 xidulu

@noanabeshima was this resolved?

Aug 12 '22 19:08 sudhakarsingh27

This seems to be an instance of a more general problem w/ trying to load CUDA libraries with insufficient memory due to JAX, TF, or other libraries pre-allocating too much for themselves. We've added an FAQ section addressing various CUDA library loading issues and solutions/workarounds, and have (hopefully) made it easier to find by including it in some error messages that often correlate to these memory starvation issues.

I'm going to go ahead and close this specific issue, since the FAQ documentation should provide proper workarounds.

Apr 22 '24 17:04 Micky774

jax
jax copied to clipboard

Error message: CUBLAS_STATUS_NOT_INITIALIZED

jax jax copied to clipboard

Error message: CUBLAS_STATUS_NOT_INITIALIZED

jax
jax copied to clipboard