djl icon indicating copy to clipboard operation
djl copied to clipboard

DJL 0.23+tensorflow cu113 can not find cuda capabilities

Open codeMan2018 opened this issue 2 years ago • 2 comments

https://github.com/deepjavalibrary/djl/issues/2573

I have similar problem,my environment is as follows: linux ,gpu t4 CUDA: 113 ARCH: 75 DJL version: 0.23.0 ai.djl.util.Platform - Found placeholder platform from: cu113-linux-x86_64:2.10.1 Default Engine: TensorFlow:2.10.1, capabilities: [MKL,] TensorFlow Library: /usr/local/app/.djl.ai/tensorflow/2.10.1-cu113-linux-x86_64/libjnitensorflow.so

engine.hasCapability(StandardCapabilities.CUDA) is always false CudaUtils.getGpuCount() is1

I checked the composition logic of the code in this url, FLAVOR is already cu113: Downloading https://publish.djl.ai/tensorflow-2.10.1/linux/cu113/THIRD_PARTY_TF_JNI_LICENSES.gz

image image image

codeMan2018 avatar Sep 27 '23 02:09 codeMan2018

Please help me look into this issue @frankfliu

codeMan2018 avatar Oct 12 '23 08:10 codeMan2018

@codeMan2018

Can test it in docker image:

git clone djl
docker run -it --rm --network=host -v $PWD:/workspace --runtime=nvidia --shm-size=2gb nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04 bash

In the docker container:

apt-get update
apt-get install openjdk-11-jdk-headles
cd /workspace/djl
./gradlew debugE -Dai.djl.default_engine=TensorFlow

Please post the output if you are not able see CUDA capability.

You should see something like:

DJL version: 0.24.0-SNAPSHOT
[DEBUG] - Using cache dir: /root/.djl.ai/tensorflow
[INFO ] - Downloading https://publish.djl.ai/tensorflow-2.10.1/linux/cu113/THIRD_PARTY_TF_JNI_LICENSES.gz ...
[INFO ] - Downloading https://publish.djl.ai/tensorflow-2.10.1/linux/cu113/LICENSE.gz ...
[INFO ] - Downloading https://publish.djl.ai/tensorflow-2.10.1/linux/cu113/libjnitensorflow.so.gz ...
[INFO ] - Downloading https://publish.djl.ai/tensorflow-2.10.1/linux/cu113/libtensorflow_framework.so.2.gz ...
[INFO ] - Downloading https://publish.djl.ai/tensorflow-2.10.1/linux/cu113/libtensorflow_cc.so.2.gz ...
[DEBUG] - Loading TensorFlow library from: /root/.djl.ai/tensorflow/2.10.1-cu113-linux-x86_64/libjnitensorflow.so
2023-10-12 00:49:22.073291: I external/org_tensorflow/tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-12 00:49:22.128940: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-12 00:49:22.165453: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-12 00:49:22.166548: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-12 00:49:22.271788: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-12 00:49:22.273339: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-12 00:49:22.858677: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-12 00:49:22.860347: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-12 00:49:22.861829: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-12 00:49:22.863294: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13584 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5
[DEBUG] - Using cache dir: /root/.djl.ai/tensorflow
Default Engine: TensorFlow:2.10.1, capabilities: [
        MKL,
        CUDA,
]
TensorFlow Library: /root/.djl.ai/tensorflow/2.10.1-cu113-linux-x86_64/libjnitensorflow.so
Default Device: gpu(0)

frankfliu avatar Oct 12 '23 08:10 frankfliu