tensorflow icon indicating copy to clipboard operation
tensorflow copied to clipboard

TensorFlow 2.7 does not detect CUDA installed through conda

Open drasmuss opened this issue 4 years ago • 32 comments

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 2.7.0
  • Python version: 3.8
  • Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): N/A
  • CUDA/cuDNN version: 11.2/8.1
  • GPU model and memory: GTX 2080Ti

Describe the current behavior

After installing cuda/cudnn through conda (conda install cudatoolkit=11.2 cudnn=8.1), TensorFlow 2.7 reports that it cannot find the cuda libraries.

2021-11-08 14:49:16.412959: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-11-08 14:49:16.413006: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-11-08 14:49:22.640508: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-11-08 14:49:22.640617: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2021-11-08 14:49:22.640698: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2021-11-08 14:49:22.640776: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2021-11-08 14:49:22.640853: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory
2021-11-08 14:49:22.640941: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory
2021-11-08 14:49:22.641022: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2021-11-08 14:49:22.641099: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2021-11-08 14:49:22.641120: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and se
tup the required libraries for your platform.

Installing TensorFlow 2.6 (or earlier) in the same environment, with the same cuda/cudnn installation, doesn't show any problem, it detects the libraries and GPU support works as expected.

The problem can be worked around by manually adding the conda lib directory to LD_LIBRARY_PATH (export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib). However, obviously this is not ideal, as it needs to be repeated/adjusted for every new conda environment. It would be better if TensorFlow just detected the conda installed libraries, as it did in TensorFlow < 2.7.

Describe the expected behavior

TensorFlow should detect cuda/cudnn libraries installed through conda, as it did in TensorFlow<2.7.

Contributing

  • Do you want to contribute a PR? (yes/no): no
  • Briefly describe your candidate solution(if contributing):

Standalone code to reproduce the issue

conda create -n tmp python=3.8
conda activate tmp
conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1
pip install "tensorflow==2.7.0"
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"  # displays []
LD_LIBRARY_PATH=LD_LIBRARY_PATH:$CONDA_PREFIX/lib python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"  # displays [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
pip install "tensorflow<2.7.0"
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"  # displays [[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]]

drasmuss avatar Nov 08 '21 14:11 drasmuss

@drasmuss , We can see that you have installed tensorflow from conda environment.Installation issues within the Anaconda environment are tracked in the Anaconda repo.Please try to install in new virtual environment from this link and let us know if it is still an issue.Thanks!

tilakrayal avatar Nov 09 '21 06:11 tilakrayal

I'm not installing tensorflow from conda, just cuda/cudnn. Tensorflow is being installed from pip like normal. And you can see in the reproduction steps I posted above that we're starting from a new virtual environment (repeated below for convenience).

conda create -n tmp python=3.8
conda activate tmp
conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1
pip install "tensorflow==2.7.0"
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"  # displays []
LD_LIBRARY_PATH=LD_LIBRARY_PATH:$CONDA_PREFIX/lib python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"  # displays [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
pip install "tensorflow<2.7.0"
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"  # displays [[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]]

Also note that nothing has changed on the conda side of things; we're still using the exact same environment with the same cuda/cudnn libraries, but it works in TF 2.6 and fails in TF 2.7. So I don't think the issue is on the conda side, something has changed in TensorFlow that has made this stop working.

drasmuss avatar Nov 09 '21 14:11 drasmuss

Open the terminal and type

nano ~/.bashrc

at the end of the file add the following two lines

export PATH=$PATH:/usr/local/cuda-11.2/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2/lib64

ensure no spaces on both side of '=' sign.

if it still does not works, try adding for version 11.0

export PATH=$PATH:/usr/local/cuda-11.0/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.0/lib64

pradyyadav avatar Nov 12 '21 08:11 pradyyadav

As mentioned, CUDA is being installed through conda, so /usr/local/cuda- is not the correct path (the correct path is given in the original post: $CONDA_PREFIX/lib). However, hard coding that into .bashrc isn't a solution, because $CONDA_PREFIX changes depending on which conda environment you have active.

drasmuss avatar Nov 12 '21 13:11 drasmuss

Conda installs are not officially supported by Google

mihaimaruseac avatar Nov 16 '21 21:11 mihaimaruseac

I installed Tensorflow 2.7 on Windows with CUDA 11.2 and cuDNN 8.1 (no conda involved). I received the same Could not load dynamic library errors. I switched to CUDA to 11.0 and it worked. I am guessing that the pip packages for Tensorflow 2.7 were accidentally built against CUDA 11.0 instead of 11.2.

ddaspit avatar Nov 29 '21 04:11 ddaspit

Open the terminal and type

nano ~/.bashrc

at the end of the file add the following two lines

export PATH=$PATH:/usr/local/cuda-11.2/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2/lib64

ensure no spaces on both side of '=' sign.

if it still does not works, try adding for version 11.0

export PATH=$PATH:/usr/local/cuda-11.0/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.0/lib64

Thank you, this also works with cuda-11.4. But how would you fix this issue in a jupyter notebook? For the pretty niche use case that you would need tf=2.7.0 features.

When I start a jupyter server within a env that has these PATHs exported, it only shows the CPU. When exporting Paths in the notebook it doesn't work either.

janniksinz avatar Nov 29 '21 21:11 janniksinz

This seems to solve the issue:

conda activate ENVNAME

cd $CONDA_PREFIX
mkdir -p ./etc/conda/activate.d
mkdir -p ./etc/conda/deactivate.d
touch ./etc/conda/activate.d/env_vars.sh
touch ./etc/conda/deactivate.d/env_vars.sh

Edit ./etc/conda/activate.d/env_vars.sh as follows:

#!/bin/sh

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib

Edit ./etc/conda/deactivate.d/env_vars.sh as follows:

#!/bin/sh

unset LD_LIBRARY_PATH

Source

https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#macos-and-linux

jesusdpa1 avatar Dec 01 '21 20:12 jesusdpa1

I don't want to be dismissive here, but there is a lack of understanding of the problem specifically introduced by TF 2.7:

  • A conda environment does install native libraries and does ensure they will be found by the os dynamic loader mechanism for the programs that want to find these libraries.
  • Until TF 2.7 this was the way it worked, like the gazillion other native apps (including cuda ones)
  • TF 2.7, not conda, specifically broke that by ignoring the os loading mechanism for an unknown/undocumented reason

This problem is not just a techie point, it does have deep implication for businesses that do real products. This method of working is the only reliable one for teams that work on more than one TF project, require multiple TF/CUDA/Python combinations on the same workstation (without root access). By the way, the CUDA stack from the official nvidia channel, like nvcc/ptxas perfectly work in conda and is recommended by Nvidia itself.

For my suffering peers, if you don't have access to root, you can use this small poorly-documented feature in your environment.yml:

name: base-tf-cuda-env
channel:
  - nvidia
  - conda-forge
  - defaults
dependencies:
  - python=3.8
# Install cuda libs + ptxas compiler from nvidia channel
# This will accelerate the compilation of kernels for your specific card
  - cudatoolkit=11
  - cudnn=8
  - cupti=11
  - cuda-nvcc
...
  - pip
  - pip:
     - tensorflow==2.7.*
variables:
  # In case you want to see your own logs and tame the TF loggorrhea
  TF_CPP_MIN_LOG_LEVEL: 3
  # Adjust to point to your local env path:
  LD_LIBRARY_PATH: /home/me/.conda/envs/thisenvname/lib

Upon conda activate, the env variables will be set for you, and unset on deactivation. Better than nothing, but might interfere with some other configuration...

holongate avatar Dec 12 '21 11:12 holongate

Upon conda activate, the env variables will be set for you, and unset on deactivation. Better than nothing, but might interfere with some other configuration...

Really appreciate the file you provided! There is a typo for channelS part, but that's awesome, thanks :+1:

chainyo avatar Jan 19 '22 08:01 chainyo

Upon conda activate, the env variables will be set for you, and unset on deactivation. Better than nothing, but might interfere with some other configuration...

@holongate 's env is a good workaround and solves the problem for me.

I'm quite astonished by how little thought was given on the issue - which is clearly a problem with TF 2.7 itself, and not with conda - and by how much time you waste on commenting that conda installs are not supported by Google.

filippocastelli avatar Jan 26 '22 13:01 filippocastelli

For anyone looking for a one-liner solution, you can do

conda env config vars set LD_LIBRARY_PATH=$CONDA_PREFIX/lib

(with the environment you want to modify activated). This has a similar effect as @jesusdpa1's solution here https://github.com/tensorflow/tensorflow/issues/52988#issuecomment-984025384, it'll set LD_LIBRARY_PATH when the environment is activated and unset it when it's deactivated.

You still need to repeat that for every new conda environment though. It would be better if TensorFlow just detected the conda installed libraries, as it did in TensorFlow<=2.6.

drasmuss avatar Jan 28 '22 20:01 drasmuss

fwiw when modifying LD_LIBRARY_PATH to $CONDA_PREFIX/lib, you would risk conflicts in OpenSSL (thereby making it impossible to use git or ssh); this generally impacts fedora-like systems (CentOS and the like). Of course, if you have the cuda libraries elsewhere (e.g. /usr/local/cuda) that would not be an issue; and generally pointing LD_LIBRARY_PATH to these local libraries (non-conda) will work

ngam avatar Feb 17 '22 19:02 ngam

Not defending this change, since it is obviously inconvenient and can cause serious issues for people on HPC (e.g. OpenSSL conflicts), but this LD_LIBRARY_PATH stuff seems to have been documented here: https://www.tensorflow.org/install/gpu#linux_setup (not sure when it was added)

ngam avatar Feb 17 '22 19:02 ngam

Upon conda activate, the env variables will be set for you, and unset on deactivation. Better than nothing, but might interfere with some other configuration...

@holongate 's env is a good workaround and solves the problem for me.

I'm quite astonished by how little thought was given on the issue - which is clearly a problem with TF 2.7 itself, and not with conda - and by how much time you waste on commenting that conda installs are not supported by Google.

Right. Quite sad to see there is an army of TF guardians of the orthodoxy to censor my remarks about lack of interest for this kind of issue but none to engage in a conversation. And writing anything about the P-devil competition is almost instantly ~~torched~~ deleted

holongate avatar Feb 26 '22 16:02 holongate

I'm continuing to have issues related to this change. Using conda's lib folder as the LD_LIBRARY_PATH affects too much of the system to be a good recommended solution. For me, when it renders my terminal useless becuase I can't use less.

The comment that "Conda installs are not officially supported by Google" might have been true at one point, but the official Tensorflow installation instructions now tell you to blindly export LD_LIBRARY_PATH. The TensorFlow team should revert the change that was made in TensorFlow 2.7 and use the ld.so system correctly.


However, since the chances of that seem slim, here's a workaround that is, in my opinion, better than the LD_LIBRARY_PATH because it only loads the cuda/cudnn libraries and not everything in the $CONDA_PREFIX/lib directory. There may be some downsides to this method, but so far it is working for me.

Building on https://github.com/tensorflow/tensorflow/issues/52988#issuecomment-1024604306:

conda env config vars set LD_PRELOAD=$CONDA_PREFIX/lib/libcudart.so:$CONDA_PREFIX/lib/libcublas.so:$CONDA_PREFIX/lib/libcublasLt.so:$CONDA_PREFIX/lib/libcufft.so:$CONDA_PREFIX/lib/libcurand.so:$CONDA_PREFIX/lib/libcusolver.so:$CONDA_PREFIX/lib/libcusparse.so:$CONDA_PREFIX/lib/libcudnn.so

This will need to be updated when the list of libraries used by TensorFlow changes.

tbekolay avatar Jun 20 '22 21:06 tbekolay

I also had to stop using the LD_LIBRARY_PATH approach because it had too many side-effects on other system packages. @tbekolay's fix is an improvement, but is obviously quite inconvenient (I have to google this issue every time I want to set up a new conda environment), and likely to break in future TensorFlow/CUDA releases.

Since TensorFlow's installation instructions now explicitly suggest using conda:

conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
python3 -m pip install tensorflow

it really would be nice to have conda properly supported, as it was in TensorFlow < 2.7.

drasmuss avatar Jun 20 '22 21:06 drasmuss

@drasmuss @tbekolay

Btw...... as someone who's a bit involved in the conda-forge side, I can confidently say that the tensorflow version we ship (currently only up to 2.8.1) is a lot more performant than the one you get from PyPI and even more performant than the one you'd get from specialized containers (e.g. nvidia ngc). Give it a go.

If you need cuda, use this:

CONDA_OVERRIDE_CUDA="11.2" conda create -n cftf tensorflow==2.8.1=*cuda112* -c conda-forge

And will get you everything you need.

ngam avatar Jun 21 '22 14:06 ngam

We will soon have 2.9.1 too, if you'd like to help or contribute here's the relevant PR: https://github.com/conda-forge/tensorflow-feedstock/pull/240

ngam avatar Jun 21 '22 14:06 ngam

Install PyTorch is so easy: conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch. It auto install cuda.

I don't know why tensorflow so difficult.

GF-Huang avatar Jul 06 '22 14:07 GF-Huang

I haven't tested performance, but for anyone experiencing the issue in this thread, I can confirm that installing tensorflow from conda-forge, rather than pip, works without having to set LD_LIBRARY_PATH or LD_PRELOAD, so I would say it's the preferred method of installation for people experiencing this issue (which I believe is anyone using TF 2.7+ from pip).

tbekolay avatar Jul 10 '22 14:07 tbekolay

I haven't tested performance, but for anyone experiencing the issue in this thread, I can confirm that installing tensorflow from conda-forge, rather than pip, works without having to set LD_LIBRARY_PATH or LD_PRELOAD, so I would say it's the preferred method of installation for people experiencing this issue (which I believe is anyone using TF 2.7+ from pip).

I can confirm this too.

madarax64 avatar Jul 11 '22 11:07 madarax64

Hi @drasmuss ! We are checking to see whether you still need help in this issue . Official documentation has been updated now to facilitate Cuda configuration through Conda now.

Could you test with 2.8/2.10 and let us know from your side. Thank you!

mohantym avatar Nov 22 '22 14:11 mohantym

The official documentation suggests manually doing export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/ every time you want to use TensorFlow, which obviously isn't really a feasible solution.

This is discussed above, but I'll reiterate the main points here for anyone coming across this thread:

  1. Currently the best solution is to use the community-maintained TensorFlow installation from conda-forge (e.g. conda install -c conda-forge tensorflow). Generally speaking that should just work.
  2. If 1. isn't possible/working for some reason (e.g. because you need to use a very recent release of TensorFlow that isn't yet available on conda-forge), the easiest solution is here https://github.com/tensorflow/tensorflow/issues/52988#issuecomment-1024604306.
  3. However, sometimes 2. can cause problems with other system packages, since you're modifying the global LD_LIBRARY_PATH (note that this is also a problem with the approach recommended in the official documentation). If you run into issues like that, you can try this approach https://github.com/tensorflow/tensorflow/issues/52988#issuecomment-1160849524, with the caveats mentioned there that this might break in future updates.

I'll reiterate again, that all of these solutions are a downgrade in the user experience from TensorFlow < 2.7, when TensorFlow just correctly detected the conda-installed CUDA libraries without any fiddling required from the user.

drasmuss avatar Nov 22 '22 15:11 drasmuss

A kind-of semi-automated snippet for solving cudatoolkit PATH problem in conda environment that I am using:

conda activate tf_env
conda install –c conda-forge cudatoolkit cudnn

mkdir -p $CONDA_PREFIX/etc/conda/activate.d
mkdir -p $CONDA_PREFIX/etc/conda/deactivate.d

printf '#!/bin/sh\nexport OLD_LD_LIBRARY_PATH=$LD_LIBRARY_PATH\nexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/\n' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh 
printf '#!/bin/sh\nexport LD_LIBRARY_PATH=$OLD_LD_LIBRARY_PATH\nunset OLD_LD_LIBRARY_PATH\n' > $CONDA_PREFIX/etc/conda/deactivate.d/env_vars.sh 

This snippet automatically set and unset neccessary environment variables when you activate or deactivate conda environment. It could be useful not only for TF users, but for some other library where it needs CUDA dependencies to be built manually from source.

TuanBC avatar Nov 23 '22 04:11 TuanBC

Hi @drasmuss , Could you please refer this documentation source.

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/

For your convenience it is recommended that you automate it with the following commands. The system paths will be automatically configured when you activate this conda environment.

mkdir -p $CONDA_PREFIX/etc/conda/activate.d echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/'$CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

With the above two lines of code it is not required to use the command export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/ every time you want to use tensorflow.Its one time setup and after that you can use the environment any no of times.

I hope this shall address the issue.Please confirm if still missing anything here. Thanks!

SuryanarayanaY avatar Mar 01 '23 13:03 SuryanarayanaY

Hi @SuryanarayanaY,

See https://github.com/tensorflow/tensorflow/issues/52988#issuecomment-1323825349 for a summary of the discussion in this thread. The short answer is that no, that solution doesn't address the issue.

Longer answer: The solution you describe from the docs is basically a worse version of idea 2 from that summary above. Worse in that it's more complicated, and it won't unset LD_LIBRARY_PATH when the environment is deactivated. But as mentioned above, idea 2 is not really a viable solution because LD_LIBRARY_PATH is a global environment variable, and modifying it has negative side effects on lots of other system packages besides TensorFlow.

And, to reiterate again, all of these "solutions" are downgrades from the behaviour prior to TensorFlow 2.7, where TensorFlow just correctly detected the CUDA libraries without requiring any manual intervention from users.

drasmuss avatar Mar 01 '23 13:03 drasmuss

@drasmuss , I'm just curious if you have observed the same behavior in 2.11 version. Also, since 2.12 release is around the corner, you can wait for few days and check it, since we are bumping the CUDA supported version to 11.8. Thanks!

sachinprasadhs avatar Mar 13 '23 17:03 sachinprasadhs

Yes, the behaviour is the same in 2.11 and 2.12.0rc1 (I wouldn't expect it to change between rc1 and the full 2.12 release).

Note that in 2.12 the error message has changed, so it displays

2023-03-13 14:41:41.580759: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-03-13 14:41:41.602435: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.

instead of the old "Could not load dynamic library..." errors, but it's the same issue.

drasmuss avatar Mar 13 '23 17:03 drasmuss

did you guys solve this problem?

githubskiy avatar May 14 '23 20:05 githubskiy