docs Specify additional steps to utilize GPU for Linux users

Specify additional steps to utilize GPU for Linux users

Apr 08 '24 11:04 sgkouzias

@MarkDaoust @markmcd

Apr 09 '24 15:04 8bitmp3

@haifeng-jin , @MarkDaoust, @8bitmp3 I await any suggestions or revisions if needed. Do we have any updates?

May 10 '24 11:05 sgkouzias

As I remembered, the current recommended way to install TF is to use pip. I do not have further info on this. @MarkDaoust may comment on this.

May 20 '24 18:05 haifeng-jin

As I remembered, the current recommended way to install TF is to use pip. I do not have further info on this. @MarkDaoust may comment on this.

@haifeng-jin it seems practically impossible for someone owning a PC with CUDA-enabled GPU to perform deep learning experiments with TensorFlow version 2.16.1 and utilize his GPU locally without manually performing some extra steps not included (until today) in the official TensorFlow documentation of the standard installation procedure of TensorFlow for Linux users with GPUs at least as a temporal fix!

It turns out that when you pip install tensorflow[and-cuda] all required NVIDIA libraries are installed as well. You just need to configure manually the environment variables as appropriate in order to utilize them and run TensorFlow with GPU.

May 20 '24 18:05 sgkouzias

It turns out that when you pip install tensorflow[and-cuda] all required NVIDIA libraries are installed as well. You just need to configure manually the environment variables as appropriate in order to utilize them and run TensorFlow with GPU.

Can we instead add these to the install guide?

May 23 '24 21:05 mihaimaruseac

configure manually the environment variables as appropriate

@mihaimaruseac shouldn't we explain/specify how to configure manually the environment variables as appropriate?

May 24 '24 13:05 sgkouzias

Why is conda mentioned in this patch? It makes the install guide more convoluted and seems unnecessary to me.

Jun 12 '24 17:06 Tachi107

Why is conda mentioned in this patch? It makes the install guide more convoluted and seems unnecessary to me.

@Tachi107 I agree. Should I proceed to erase everything related to conda refered as option 1 and just keep one suggested option (create a venv virtual environment)? Perhaps it would be better and more straight-forward?

Jun 12 '24 17:06 sgkouzias

Note that I'm not a tensorflow maintainer, just a casual user who happened to stumble upon this patch. But yeah, if I were you I would just show how to setup the venv. Conda users should already know how to do that with their non-default setup :)

Jun 12 '24 18:06 Tachi107

Note that I'm not a tensorflow maintainer, just a casual user who happened to stumble upon this patch. But yeah, if I were you I would just show how to setup the venv. Conda users should already know how to do that with their non-default setup :)

@Tachi107 thank you. It seems very reasonable to simplify the guide like that. However for now I will keep it as is and await for the comments of the maintainers as well.

Jun 12 '24 18:06 sgkouzias

@haifeng-jin , @MarkDaoust, @8bitmp3 I await any suggestions or revisions if needed. Do we have any updates?

Jun 17 '24 09:06 sgkouzias

There is no need to use conda, a standard venv works fine. In 2.15, tensorflow knew to go look for the NVIDIA binaries installed with pip. With TF 2.16, you can help it by placing the the binaries on LD_LIBRARY_PATH, like suggested in this PR, or by creating symlinks from the TF package to the pip installed nvidia packages. E.g.,

python -m venv my-venv
source my-venv/bin/activate
python -m pip install tensorflow[and-cuda]
pushd $(dirname $(python -c 'print(__import__("tensorflow").__file__)'))
ln -svf ../nvidia/*/lib/*.so* .
popd

This produces output like:

'./libcublasLt.so.12' -> '../nvidia/cublas/lib/libcublasLt.so.12'
'./libcublas.so.12' -> '../nvidia/cublas/lib/libcublas.so.12'
'./libnvblas.so.12' -> '../nvidia/cublas/lib/libnvblas.so.12'
'./libcheckpoint.so' -> '../nvidia/cuda_cupti/lib/libcheckpoint.so'
'./libcupti.so.12' -> '../nvidia/cuda_cupti/lib/libcupti.so.12'
'./libnvperf_host.so' -> '../nvidia/cuda_cupti/lib/libnvperf_host.so'
'./libnvperf_target.so' -> '../nvidia/cuda_cupti/lib/libnvperf_target.so'
'./libpcsamplingutil.so' -> '../nvidia/cuda_cupti/lib/libpcsamplingutil.so'
'./libnvrtc-builtins.so.12.3' -> '../nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.12.3'
'./libnvrtc.so.12' -> '../nvidia/cuda_nvrtc/lib/libnvrtc.so.12'
'./libcudart.so.12' -> '../nvidia/cuda_runtime/lib/libcudart.so.12'
'./libcudnn_adv_infer.so.8' -> '../nvidia/cudnn/lib/libcudnn_adv_infer.so.8'
'./libcudnn_adv_train.so.8' -> '../nvidia/cudnn/lib/libcudnn_adv_train.so.8'
'./libcudnn_cnn_infer.so.8' -> '../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8'
'./libcudnn_cnn_train.so.8' -> '../nvidia/cudnn/lib/libcudnn_cnn_train.so.8'
'./libcudnn_ops_infer.so.8' -> '../nvidia/cudnn/lib/libcudnn_ops_infer.so.8'
'./libcudnn_ops_train.so.8' -> '../nvidia/cudnn/lib/libcudnn_ops_train.so.8'
'./libcudnn.so.8' -> '../nvidia/cudnn/lib/libcudnn.so.8'
'./libcufft.so.11' -> '../nvidia/cufft/lib/libcufft.so.11'
'./libcufftw.so.11' -> '../nvidia/cufft/lib/libcufftw.so.11'
'./libcurand.so.10' -> '../nvidia/curand/lib/libcurand.so.10'
'./libcusolverMg.so.11' -> '../nvidia/cusolver/lib/libcusolverMg.so.11'
'./libcusolver.so.11' -> '../nvidia/cusolver/lib/libcusolver.so.11'
'./libcusparse.so.12' -> '../nvidia/cusparse/lib/libcusparse.so.12'
'./libnccl.so.2' -> '../nvidia/nccl/lib/libnccl.so.2'
'./libnvJitLink.so.12' -> '../nvidia/nvjitlink/lib/libnvJitLink.so.12'

This is essentially what we do from the R interface in tensorflow::install_tensorflow() and keras3::install_keras()

Jun 17 '24 16:06 t-kalinowski

@t-kalinowski thank you very much for your valuable advice. I revised the PR accordingly.

Jun 17 '24 19:06 sgkouzias

@sgkouzias if you also create a symlink at my-venv/bin/ptxas -> my-venv/lib/python.../site-packages/.../bin/ptxax, then you could probably get away without needing to require users to modify default activate and deactivate scripts.

Jun 17 '24 20:06 t-kalinowski

@sgkouzias if you also create a symlink at my-venv/bin/ptxas -> my-venv/lib/python.../site-packages/.../bin/ptxax, then you could probably get away without needing to require users to modify default activate and deactivate scripts.

@t-kalinowski thank you so much for your advice. Instructions have been totally revised as per your comments. Modifications to default activate and deactivate scripts are not required from users. Instructions should resemble more or less what you do in the R interface.

Jun 18 '24 16:06 sgkouzias

@8bitmp3 , @haifeng-jin , @MarkDaoust even TensorFlow version 2.17.0.rc0 requires to specify additional steps to utilize GPU for Linux users. The suggested instructions of this pull request offer a tested solution. I await your comments.

Jun 19 '24 16:06 sgkouzias

@learning-to-play, @SeeForTwo, @8bitmp3, @haifeng-jin, @MarkDaoust, @markmcd

Unfortunately the latest release namely TensorFlow 2.16.2 does not fix the ptxas bug. When running a training script I get the error:

ptxas returned an error during compilation of ptx to sass: 'INTERNAL: ptxas 12.3.103 has a bug that we think can affect XLA. Please use a different version.' If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided. Aborted (core dumped)

So it seems as TensorFlow 2.16.2 Fails to work with GPUs as well !

Notes:

Successful installation was verified by running: python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
The solution included in the submitted pull request pending review helped to get rid of the ptxas bug and ultimately enforced TensorFlow 2.16.2 to work with my GPU:

ln -sf $(find $(dirname $(dirname $(python -c "import nvidia.cuda_nvcc; print(nvidia.cuda_nvcc.__file__)"))/*/bin/) -name ptxas -print -quit) $VIRTUAL_ENV/bin/ptxas

Jul 01 '24 11:07 sgkouzias

Thank you for the contribution, @sgkouzias :) Given that the [and-cuda] installation now does detect pip-installed CUDA components again, please add a disclaimer specify that that symbolic links are only necessary in case the intended way doesn't work, i.e. the components aren't being detected, and/or conflict with the existing system CUDA installation (like ptxas for you).

Jul 02 '24 13:07 belitskiy

Thank you for the contribution, @sgkouzias :) Given that the [and-cuda] installation now does detect pip-installed CUDA components again, please add a disclaimer specify that that symbolic links are only necessary in case the intended way doesn't work, i.e. the components aren't being detected, and/or conflict with the existing system CUDA installation (like ptxas for you).

@belitskiy, @learning-to-play I revised instructions as advised and will be awaiting your feedback. It is my honor to contribute to the TensorFlow community.

Jul 02 '24 16:07 sgkouzias

Thanks for all your work everyone (especially @sgkouzias)!

I just tweaked the order so that this new GPU debugging step is after the step where you test the GPU.

I think this is still right so I'm merging it. But LMK if I misunderstood anything.

Sep 04 '24 21:09 MarkDaoust

Thanks for all your work everyone (especially @sgkouzias)!

I just tweaked the order so that this new GPU debugging step is after the step where you test the GPU.

I think this is still right so I'm merging it. But LMK if I misunderstood anything.

Thank you @MarkDaoust 🙏 it is my honour. I noticed you mentioned merging, but it seems the pull request still needs a formal review due to branch protection rules. Could you please take a quick look and approve it when you have a chance? Many thanks again!

Sep 05 '24 16:09 sgkouzias

Really it has everything it needs we're just waiting for the internal merge, it should be through soon.

Sep 05 '24 17:09 MarkDaoust

docs docs copied to clipboard

Specify additional steps to utilize GPU for Linux users

docs
docs copied to clipboard