jax
jax copied to clipboard
Error message: CUBLAS_STATUS_NOT_INITIALIZED
nvcc --version yields
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
I installed jax with
pip install --upgrade pip
pip install --upgrade jax[cuda111] -f https://storage.googleapis.com/jax-releases/jax_releases.html
My code is
import jax.numpy as jnp
from jax import grad, jit, vmap
from jax import random
from time import time
key = random.PRNGKey(0)
def apply_matrix(v):
global key
mat = random.normal(key, (150, 100))
return jnp.dot(mat, v)
def batched_apply_matrix(v_batched):
return vmap(apply_matrix)(v_batched)
v_batched = random.normal(key, (10, 100))
batched_apply_matrix(v_batched).block_until_ready()
The program output is
2021-06-26 17:01:46.891583: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-06-26 17:01:46.891610: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gemm_algorithm_picker.cc:113] Check failed: stream->parent()->GetBlasGemmAlgorithms(&algorithms)
Aborted (core dumped)
@noanabeshima I am experiencing the same issue. The code runs fine though when I paste it into a Google Colab and connect to a local run time. I am not sure why there is a difference.
Can you set the environment variable TF_CPP_MIN_LOG_LOVEL=0, rerun, and send us the full output?
Also can you share the output of nvidia-smi?
Have you solved this problem. I meet the same problem.
2021-07-03 06:59:52.537404: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-03 06:59:54.543008: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-07-03 06:59:54.544110: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-07-03 06:59:54.576691: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-03 06:59:54.577708: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:00:08.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.72GiB deviceMemoryBandwidth: 836.37GiB/s
2021-07-03 06:59:54.577847: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-03 06:59:54.582512: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-07-03 06:59:54.582599: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-07-03 06:59:54.585051: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-07-03 06:59:54.585364: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-07-03 06:59:54.585940: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-07-03 06:59:54.586916: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-07-03 06:59:54.587069: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-07-03 06:59:54.587181: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-03 06:59:54.588232: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-03 06:59:54.589174: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
classifier: token
hidden_size: 1024
name: ViT-L_16
patches:
size: !!python/tuple
- 16
- 16
representation_size: null
transformer:
attention_dropout_rate: 0.0
dropout_rate: 0.1
mlp_dim: 4096
num_heads: 16
num_layers: 24
2021-07-03 06:59:58.442563: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:171] XLA service 0x6527e20 initialized for platform Interpreter (this does not guarantee that XLA will be used). Devices:
2021-07-03 06:59:58.442602: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:179] StreamExecutor device (0): Interpreter, <undefined>
2021-07-03 06:59:58.446140: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/tfrt_cpu_pjrt_client.cc:160] TfrtCpuClient created.
INFO:absl:Starting the local TPU driver.
INFO:absl:Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local://
2021-07-03 06:59:58.533758: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-03 06:59:58.534812: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:171] XLA service 0x64348a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-07-03 06:59:58.534847: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:179] StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
2021-07-03 06:59:58.535193: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/gpu_device.cc:298] Using BFC allocator.
2021-07-03 06:59:58.535261: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/gpu_device.cc:257] XLA backend allocating 30352696934 bytes on device 0 for BFCAllocator.
2021-07-03 06:59:58.535727: I external/org_tensorflow/tensorflow/stream_executor/tpu/tpu_platform_interface.cc:74] No TPU platform found.
INFO:absl:Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available.
2021-07-03 06:59:58.995298: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-07-03 06:59:58.995843: W external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:342] There was an error before creating cudnn handle: cudaErrorInitializationError : initialization error
2021-07-03 06:59:58.999101: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:374] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2021-07-03 06:59:58.999161: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:382] Possibly insufficient driver version: 418.67.0
2021-07-03 06:59:58.999651: W external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:342] There was an error before creating cudnn handle: cudaErrorInitializationError : initialization error
2021-07-03 06:59:58.999675: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:374] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2021-07-03 06:59:58.999696: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:382] Possibly insufficient driver version: 418.67.0
2021-07-03 06:59:58.999746: W external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:342] There was an error before creating cudnn handle: cudaErrorInitializationError : initialization error
2021-07-03 06:59:58.999755: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:374] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2021-07-03 06:59:58.999768: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:382] Possibly insufficient driver version: 418.67.0
2021-07-03 06:59:59.000039: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:691] Failed to determine best cudnn convolution algorithm: Internal: All algorithms tried for %custom-call = (f32[1,24,24,1024]{2,1,3,0}, u8[0]{0}) custom-call(f32[1,384,384,3]{2,1,3,0} %copy.3, f32[16,16,3,1024]{1,0,2,3} %copy.4), window={size=16x16 stride=16x16}, dim_labels=b01f_01io->b01f, custom_call_target="__cudnn$convForward", metadata={op_type="conv_general_dilated" op_name="conv_general_dilated[ batch_group_count=1\n dimension_numbers=ConvDimensionNumbers(lhs_spec=(0, 3, 1, 2), rhs_spec=(3, 2, 0, 1), out_spec=(0, 3, 1, 2))\n feature_group_count=1\n lhs_dilation=(1, 1)\n lhs_shape=(1, 384, 384, 3)\n padding=((0, 0), (0, 0))\n precision=None\n preferred_element_type=None\n rhs_dilation=(1, 1)\n rhs_shape=(16, 16, 3, 1024)\n window_strides=(16, 16) ]"}, backend_config="{\"algorithm\":\"0\",\"tensor_ops_enabled\":false,\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}" failed. Falling back to default algorithm.
Convolution performance may be suboptimal.
2021-07-03 06:59:59.062658: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_asm_compiler.cc:63] cuLinkAddData fails. This is usually caused by stale driver version.
2021-07-03 06:59:59.062716: E external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc:895] The CUDA linking API did not work. Please use XLA_FLAGS=--xla_gpu_force_compilation_parallelism=1 to bypass it, but expect to get longer compilation time due to the lack of multi-threading.
Traceback (most recent call last):
File "run_cpu.py", line 89, in <module>
vit_model = ViTL16(data_root=os.path.join(data_root, "vision_transformer"))
File "run_cpu.py", line 66, in __init__
logits, feature = model.apply(dict(params=params), (np.array(img_cv) / 128. - 1)[None, ...], train=False)
File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/flax/linen/module.py", line 914, in apply
)(variables, *args, **kwargs, rngs=rngs)
File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/flax/core/scope.py", line 675, in wrapper
y = fn(root, *args, **kwargs)
File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/flax/linen/module.py", line 1135, in scope_fn
return fn(module.clone(parent=scope), *args, **kwargs)
File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/flax/linen/module.py", line 271, in wrapped_module_method
y = fun(self, *args, **kwargs)
File "/home/tione/notebook/vision_transformer/vit_jax/models.py", line 264, in __call__
x)
File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/flax/linen/module.py", line 271, in wrapped_module_method
y = fun(self, *args, **kwargs)
File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/flax/linen/linear.py", line 278, in __call__
precision=self.precision)
File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/jax/_src/lax/lax.py", line 633, in conv_general_dilated
preferred_element_type=preferred_element_type)
File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/jax/core.py", line 264, in bind
out = top_trace.process_primitive(self, tracers, params)
File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/jax/core.py", line 603, in process_primitive
return primitive.impl(*tracers, **params)
File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/jax/interpreters/xla.py", line 248, in apply_primitive
compiled_fun = xla_primitive_callable(prim, *unsafe_map(arg_spec, args), **params)
File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/jax/_src/util.py", line 186, in wrapper
return cached(config._trace_context(), *args, **kwargs)
File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/jax/_src/util.py", line 179, in cached
return f(*args, **kwargs)
File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/jax/interpreters/xla.py", line 297, in xla_primitive_callable
compiled = backend_compile(backend, built_c, options)
File "/opt/conda/envs/tensorflow2.4_py3/lib/python3.6/site-packages/jax/interpreters/xla.py", line 360, in backend_compile
return backend.compile(built_c, compile_options=options)
RuntimeError: Unknown: no kernel image is available for execution on the device
in external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_asm_compiler.cc(66): 'status'
@alexwongdl Can you share the output of nvidia-smi and your CUDA toolkit version (e.g., nvcc --version?) My guess is your NVidia driver is too old for your CUDA toolkit version.
@hawkinsp I think I have the same issue. My code:
import flax.linen as nn
import jax
import jax.numpy as jnp
import tensorflow_datasets as tfds
def main():
ds, info = tfds.load('fashion_mnist', with_info=True)
num_classes = info.features['label'].num_classes
image_shape = info.features['image'].shape
num_test = info.splits['test'].num_examples
num_train = info.splits['train'].num_examples
def pp(iter):
"""Preprocesses images/labels for use with JAX."""
for batch in iter:
yield (
jnp.array(batch['image']) / 255.,
jax.nn.one_hot(batch['label'], num_classes),
)
train_images, train_labels = next(pp(iter(ds['train'].batch(num_train))))
test_images, test_labels = next(pp(iter(ds['test'].batch(num_test))))
class Model(nn.Module):
def setup(self):
self.dense = nn.Dense(features=10)
def __call__(self, x):
batch_size = x.shape[0]
x = x.reshape([batch_size, -1])
x = self.dense(x)
return nn.log_softmax(x)
model = Model()
rng = jax.random.PRNGKey(0)
variables = model.init(rng, train_images[:1])
params = variables['params']
def evaluate(params):
log_probs = model.apply({'params': params}, test_images)
return (log_probs.argmax(axis=-1) == test_labels.argmax(axis=-1)).mean()
print(evaluate(params))
if __name__=='__main__':
main()
After running I am getting
2021-09-07 21:57:00.181437: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-09-07 21:57:00.181472: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gemm_algorithm_picker.cc:118] Check failed: stream->parent()->GetBlasGemmAlgorithms(&algorithms)
Aborted (core dumped)
nvcc --version output:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jul_14_19:41:19_PDT_2021
Cuda compilation tools, release 11.4, V11.4.100
Build cuda_11.4.r11.4/compiler.30188945_0
nvidia-smi output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| N/A 49C P8 10W / N/A | 383MiB / 7982MiB | 5% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1104 G /usr/lib/xorg/Xorg 175MiB |
| 0 N/A N/A 1479 G /usr/bin/gnome-shell 37MiB |
| 0 N/A N/A 2589 G ...AAAAAAAAA= --shared-files 44MiB |
| 0 N/A N/A 3897 G ...AAAAAAAAA= --shared-files 122MiB |
+-----------------------------------------------------------------------------+
I am facing the exact same problem with the latest GPU drivers. Sample code
from jax import jit
import jax.numpy as jnp
import scipy
import math
@jit
def hermitian(a):
return jnp.conjugate(jnp.swapaxes(a, -1, -2))
def fourier_basis(n):
"""Fourier basis
"""
F = scipy.linalg.dft(n) / math.sqrt(n)
# From numpy to jax
F = jnp.array(F)
# Perform conjugate transpose
F = hermitian(F)
return F
N = 16
A = fourier_basis(N)
AH = hermitian(A)
# THIS LINE CRASHES
G = AH @ A
Error message after setting TF_CPP_MIN_LOG_LOVEL=0:
(base) shailesh@wks0018:~/work/cr-sparse/junk$ export TF_CPP_MIN_LOG_LOVEL=0
(base) shailesh@wks0018:~/work/cr-sparse/junk$ python test_fourier_gram.py
2021-10-23 21:16:31.589081: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-10-23 21:16:31.589118: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gemm_algorithm_picker.cc:118] Check failed: stream->parent()->GetBlasGemmAlgorithms(&algorithms)
Aborted (core dumped)
smi:
(base) shailesh@wks0018:~/work/cr-sparse/junk$ nvidia-smi
Sat Oct 23 21:21:46 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:02:00.0 Off | N/A |
| 0% 25C P8 5W / 120W | 5480MiB / 6077MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2848 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 3418 G /usr/bin/gnome-shell 4MiB |
| 0 N/A N/A 13682 C ...lesh/anaconda3/bin/python 5461MiB |
+-----------------------------------------------------------------------------+
nvcc:
(base) shailesh@wks0018:~/work/cr-sparse/junk$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Sep_13_19:13:29_PDT_2021
Cuda compilation tools, release 11.5, V11.5.50
Build cuda_11.5.r11.5/compiler.30411180_0
I had used the exact command to install jaxlib
# Installs the wheel compatible with CUDA 11 and cuDNN 8.2 or newer.
pip install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_releases.html # Note: wheels only available on linux.
JAX and JAXLIB versions:
(base) shailesh@wks0018:~/work/cr-sparse/junk$ python
Python 3.8.8 (default, Apr 13 2021, 19:58:26)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import jax
>>> jax.__version__
'0.2.24'
>>> import jaxlib
>>> jaxlib.__version__
'0.1.73'
>>>
Following #5380, when I tried out with export XLA_PYTHON_CLIENT_PREALLOCATE=false, it stopped crashing.
FYI, In my case, I had multiple cuda in the path environment - 10.2 / 11.1 / 11.3 and jax[cuda111]==0.2.20. It seems that it catches the latest version (i.e., 11.3). When I get rid of 11.3 in the path, it works fine.
I second the export XLA_PYTHON_CLIENT_PREALLOCATE=false solution.
In my case somehow a mix of using haiku + importing tensorflow (through a third party library) resulting in CUDNN_STATUS_EXECUTION_FAILED and a cryptic message of W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:691] Failed to determine best cudnn convolution algorithm. It's pretty weird since there examples in haiku that import Tensorflow and I can run them, but in my test case removing import tensorflow does work so there must be some weird interplay and memory allocation between these libraries that I don't quite understand.
When using Haiku together with TensorFlow datasets (tfds), I find it helpful to set
TF_FORCE_GPU_ALLOW_GROWTH=true such that TF wouldn't preallocate loads of GPU memory.
@noanabeshima was this resolved?
This seems to be an instance of a more general problem w/ trying to load CUDA libraries with insufficient memory due to JAX, TF, or other libraries pre-allocating too much for themselves. We've added an FAQ section addressing various CUDA library loading issues and solutions/workarounds, and have (hopefully) made it easier to find by including it in some error messages that often correlate to these memory starvation issues.
I'm going to go ahead and close this specific issue, since the FAQ documentation should provide proper workarounds.