deepmd-kit icon indicating copy to clipboard operation
deepmd-kit copied to clipboard

[BUG] How to specify TENSORFLOW_ROOT during install?

Open romankempt opened this issue 2 years ago • 15 comments

Bug summary

I'm trying to install deempd-kit on a HPC cluster from source with GPU support with a custom prefix. Tensorflow is already installed, but during the installation, pip fails to find the tensorflow headers (which are there).

Install command: pip install --prefix=$PREFIX -vvv deepmd-kit/ --no-build-isolation --no-dependencies

Importing tensorflow in python works fine and executing python -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))" also works. Just skbuild can't find the headers, which are here for example: /p/software/juwelsbooster/stages/2022/software/TensorFlow/2.6.0-gcccoremkl-11.2.0-2021.4.0-CUDA-11.5/lib/python3.9/site-packages/tensorflow/include/tensorflow/core/public/session.h

How do I tell the installation where to find those headers?

DeePMD-kit Version

2.1.4

TensorFlow Version

2.6.0

How did you download the software?

Built from source

Input Files, Running Commands, Error Log, etc.

Relevant output:

  -- Configuring incomplete, errors occurred!
  See also "/tmp/pip-req-build-u5kc1d1a/_cmake_test_compile/build/CMakeFiles/CMakeOutput.log".
  Not searching for unused variables given on the command line.
  -- The C compiler identification is GNU 11.2.0
  -- Detecting C compiler ABI info
  -- Detecting C compiler ABI info - done
  -- Check for working C compiler: /p/software/juwelsbooster/stages/2022/software/GCCcore/11.2.0/bin/cc - skipped
  -- Detecting C compile features
  -- Detecting C compile features - done
  -- The CXX compiler identification is GNU 11.2.0
  -- Detecting CXX compiler ABI info
  -- Detecting CXX compiler ABI info - done
  -- Check for working CXX compiler: /p/software/juwelsbooster/stages/2022/software/GCCcore/11.2.0/bin/c++ - skipped
  -- Detecting CXX compile features
  -- Detecting CXX compile features - done
  -- Configuring done
  -- Generating done
  -- Build files have been written to: /tmp/pip-req-build-u5kc1d1a/_cmake_test_compile/build
  -- The C compiler identification is GNU 11.2.0
  -- The CXX compiler identification is GNU 11.2.0
  -- Detecting C compiler ABI info
  -- Detecting C compiler ABI info - done
  -- Check for working C compiler: /p/software/juwelsbooster/stages/2022/software/GCCcore/11.2.0/bin/cc - skipped
  -- Detecting C compile features
  -- Detecting C compile features - done
  -- Detecting CXX compiler ABI info
  -- Detecting CXX compiler ABI info - done
  -- Check for working CXX compiler: /p/software/juwelsbooster/stages/2022/software/GCCcore/11.2.0/bin/c++ - skipped
  -- Detecting CXX compile features
  -- Detecting CXX compile features - done
  -- Found Git: /usr/bin/git (found version "2.31.1")
  -- Supported model version: 1.1
  -- Found CUDA: /p/software/juwelsbooster/stages/2022/software/CUDA/11.5 (found version "11.5")
  -- Found CUDA in /p/software/juwelsbooster/stages/2022/software/CUDA/11.5, build nv GPU support
  -- Will not build AMD GPU support
  CMake Error at cmake/Findtensorflow.cmake:84 (message):
    Not found 'include/tensorflow/core/public/session.h' directory or other
    header files in path
    '/p/software/juwelsbooster/stages/2022/software/TensorFlow/2.6.0-gcccoremkl-11.2.0-2021.4.0-CUDA-11.5/lib/python3.9/site-packages/tensorflow;/p/software/juwelsbooster/stages/2022/software/TensorFlow/2.6.0-gcccoremkl-11.2.0-2021.4.0-CUDA-11.5/lib/python3.9/site-packages/tensorflow/../tensorflow_core;/p/software/juwelsbooster/stages/2022/software/TensorFlow/2.6.0-gcccoremkl-11.2.0-2021.4.0-CUDA-11.5/lib/python3.9/site-packages/tensorflow/../../../..'
    You can manually set the tensorflow install path by -DTENSORFLOW_ROOT
  Call Stack (most recent call first):
    CMakeLists.txt:88 (find_package)


  -- Configuring incomplete, errors occurred!
  See also "/tmp/pip-req-build-u5kc1d1a/_skbuild/linux-x86_64-3.9/cmake-build/CMakeFiles/CMakeOutput.log".
    File "/p/project/eat2d/libs/install_deepmd_gpu/lib/python3.9/site-packages/skbuild/setuptools_wrap.py", line 637, in setup
      env = cmkr.configure(
    File "/p/project/eat2d/libs/install_deepmd_gpu/lib/python3.9/site-packages/skbuild/cmaker.py", line 328, in configure
      raise SKBuildError(

Steps to Reproduce

Log in to HPC, load preinstalled python and tensorflow module, try to install DeepMD with pip with custom prefix.

Further Information, Files, and Links

No response

romankempt avatar Sep 28 '22 10:09 romankempt

These files should exist: https://github.com/deepmodeling/deepmd-kit/blob/6e3d4a626af965e951298f1bce9a9d0a2bbda317/source/cmake/Findtensorflow.cmake#L65-L69 https://github.com/deepmodeling/deepmd-kit/blob/6e3d4a626af965e951298f1bce9a9d0a2bbda317/source/cmake/Findtensorflow.cmake#L76

njzjz avatar Sep 28 '22 19:09 njzjz

All these files exist in a directory path like this:

... lib/python3.9/site-packages/tensorflow/include/tensorflow/core/framework/ ... lib/python3.9/site-packages/tensorflow/include/tensorflow/core/public/ ... lib/python3.9/site-packages/tensorflow/include/tensorflow/core/platform/

But they are not found by CMake. Might be related to this discussion https://github.com/deepmodeling/deepmd-kit/discussions/272, but I don't see if a solution has been found.

I tried modifying TENSORFLOW_ROOT in the setup.py, so far with no success.

romankempt avatar Sep 29 '22 10:09 romankempt

In #272, @hanyecn just didn't have a protobuf header. Do you have lib/python3.9/site-packages/tensorflow/include/google/protobuf/type.pb.h? It should be included in any version of TensorFlow package.

By the way, how did you install TensorFlow?

njzjz avatar Sep 29 '22 19:09 njzjz

Sorry for the delay!

TensorFlow has been preinstalled by the HPC support (I've struggled to install tensorflow myself with NCCL etc.). The google protobuf directory is missing, the only protobuf headers I could find are tensorflow/include/tensorflow/core/framework/full_type.pb.h

romankempt avatar Oct 05 '22 11:10 romankempt

If an external protobuf is linked to TensorFlow library, you can find it by ldd libtensorflow_framework.so.

njzjz avatar Oct 05 '22 19:10 njzjz

The output of this is:

ldd /p/software/juwelsbooster/stages/2022/software/TensorFlow/2.6.0-gcccoremkl-11.2.0-2021.4.0-CUDA-11.5/lib/python3.9/site-packages/tensorflow/libtensorflow_framework.so.2
        linux-vdso.so.1 (0x00007fff56f58000)
        libdl.so.2 => /usr/lib64/libdl.so.2 (0x000014828c7d0000)
        libm.so.6 => /usr/lib64/libm.so.6 (0x000014828c44e000)
        libprotobuf.so.3.17.3.0 => /p/software/juwelsbooster/stages/2022/software/protobuf/3.17.3-GCCcore-11.2.0/lib/libprotobuf.so.3.17.3.0 (0x000014828c14c000)
        libsnappy.so.1 => /p/software/juwelsbooster/stages/2022/software/snappy/1.1.9-GCCcore-11.2.0/lib/libsnappy.so.1 (0x000014828deef000)
        libdouble-conversion.so.3 => /p/software/juwelsbooster/stages/2022/software/double-conversion/3.1.6-GCCcore-11.2.0/lib/libdouble-conversion.so.3 (0x000014828dedd000)
        libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x000014828bf2c000)
        libgif.so.7 => /p/software/juwelsbooster/stages/2022/software/giflib/5.2.1-GCCcore-11.2.0/lib/libgif.so.7 (0x000014828ded3000)
        libjpeg.so.8 => /p/software/juwelsbooster/stages/2022/software/libjpeg-turbo/2.1.1-GCCcore-11.2.0/lib/libjpeg.so.8 (0x000014828de27000)
        libz.so.1 => /p/software/juwelsbooster/stages/2022/software/zlib/1.2.11-GCCcore-11.2.0/lib/libz.so.1 (0x000014828de0c000)
        libstdc++.so.6 => /p/software/juwelsbooster/stages/2022/software/GCCcore/11.2.0/lib64/libstdc++.so.6 (0x000014828bd00000)
        libgcc_s.so.1 => /p/software/juwelsbooster/stages/2022/software/GCCcore/11.2.0/lib64/libgcc_s.so.1 (0x000014828ddf2000)
        libc.so.6 => /usr/lib64/libc.so.6 (0x000014828b93b000)
        /lib64/ld-linux-x86-64.so.2 (0x000014828dce2000)

what is the correct procedure then? Do I need to specify TENSORFLOW_ROOT to the protobuf directory?

romankempt avatar Oct 06 '22 09:10 romankempt

Yes.

njzjz avatar Oct 06 '22 18:10 njzjz

The cmake module needs to find it automatically - let me consider how to implement it.

njzjz avatar Oct 06 '22 18:10 njzjz

The cmake module needs to find it automatically - let me consider how to implement it.

Please check if #1975 works to you.

njzjz avatar Oct 06 '22 21:10 njzjz

Thank you for responding so quickly! The current devel branch raises the following error:

  -- Found CUDA: /p/software/juwelsbooster/stages/2022/software/CUDA/11.5 (found version "11.5")
  -- Found CUDA in /p/software/juwelsbooster/stages/2022/software/CUDA/11.5, build nv GPU support
  -- Will not build AMD GPU support
  -- Disabled cpp interface build, looking for tensorflow_framework
  -- protoTensorFlow_INCLUDE_DIRS_GOOGLE-NOTFOUND
  -- Protobuf headers are not found in the directory of TensorFlow, assuming external protobuf was used to build TensorFlow
  CMake Warning (dev) at cmake/Findtensorflow.cmake:157 (file):
    You have used file(GET_RUNTIME_DEPENDENCIES) in project mode.  This is
    probably not what you intended to do.  Instead, please consider using it in
    an install(CODE) or install(SCRIPT) command.  For example:

      install(CODE [[
        file(GET_RUNTIME_DEPENDENCIES
          # ...
          )
        ]])
  Call Stack (most recent call first):
    CMakeLists.txt:95 (find_package)
  This warning is for project developers.  Use -Wno-dev to suppress it.

  CMake Error at cmake/Findtensorflow.cmake:157 (file):
    file Could not resolve file libdouble-conversion.so.3
  Call Stack (most recent call first):
    CMakeLists.txt:95 (find_package)

romankempt avatar Oct 07 '22 08:10 romankempt

I think the reason is that "file(GET_RUNTIME_DEPENDENCIES) does not support LD_LIBRARY_PATH".

Could you print the output of LD_DEBUG=libs ldd /p/software/juwelsbooster/stages/2022/software/TensorFlow/2.6.0-gcccoremkl-11.2.0-2021.4.0-CUDA-11.5/lib/python3.9/site-packages/tensorflow/libtensorflow_framework.so.2? Also, I'd like to see your LD_LIBRARY_PATH.

njzjz avatar Oct 07 '22 19:10 njzjz

I expect #1976 can resolve the error in your environment.

njzjz avatar Oct 07 '22 20:10 njzjz

Thank you for all your work!

I've attached the LD_DEBUG log and the LD_LIBRARY_PATH is at the end of the file. LD_DEBUG.txt

With #1976, I get the following error:

  -- Found CUDA: /p/software/juwelsbooster/stages/2022/software/CUDA/11.5 (found version "11.5")
  -- Found CUDA in /p/software/juwelsbooster/stages/2022/software/CUDA/11.5, build nv GPU support
  -- Will not build AMD GPU support
  -- Disabled cpp interface build, looking for tensorflow_framework
  -- Protobuf headers are not found in the directory of TensorFlow, assuming external protobuf was used to build TensorFlow
  CMake Warning (dev) at cmake/Findtensorflow.cmake:156 (file):
    You have used file(GET_RUNTIME_DEPENDENCIES) in project mode.  This is
    probably not what you intended to do.  Instead, please consider using it in
    an install(CODE) or install(SCRIPT) command.  For example:

      install(CODE [[
        file(GET_RUNTIME_DEPENDENCIES
          # ...
          )
        ]])
  Call Stack (most recent call first):
    CMakeLists.txt:95 (find_package)
  This warning is for project developers.  Use -Wno-dev to suppress it.

  CMake Error at cmake/Findtensorflow.cmake:172 (message):
    TensorFlow is not linked to protobuf
  Call Stack (most recent call first):
    CMakeLists.txt:95 (find_package)

romankempt avatar Oct 10 '22 08:10 romankempt

I see. The previous direction is correct, but I have to convert LD_LIBRARY_PATH to a cmake list.

njzjz avatar Oct 10 '22 19:10 njzjz

Please check #1982.

njzjz avatar Oct 10 '22 21:10 njzjz

The error persists, unfortunately. I've modified the Findtensorflow.cmake to write out the libraries that CMake detects as linked to tensorflow:

  -- UNRESOLVED_DEPENDENCIES_VAR
  -- libdouble-conversion.so.3
  -- libjpeg.so.8
  -- libprotobuf.so.3.17.3.0
  -- RESOLVED_DEPENDENCIES_VAR
  -- /usr/lib64/ld-linux-x86-64.so.2
  -- /usr/lib64/libc.so.6
  -- /usr/lib64/libdl.so.2
  -- /usr/lib64/libgcc_s.so.1
  -- /usr/lib64/libgif.so.7
  -- /usr/lib64/libm.so.6
  -- /usr/lib64/libpthread.so.0
  -- /usr/lib64/libsnappy.so.1
  -- /usr/lib64/libstdc++.so.6
  -- /usr/lib64/libz.so.1
  -- Protobuf headers are not found in the directory of TensorFlow, assuming external protobuf was used to build TensorFlow

The protobuf ends up in UNRESOLVED_DEPENDENCIES_VAR.

romankempt avatar Oct 14 '22 08:10 romankempt

Sadly, I don't know why it doesn't work as I don't have such an environment to reproduce...

njzjz avatar Oct 14 '22 17:10 njzjz

Although the issue is still here, you can bypass it by manually symlinking probuf header directory to the tensorflow header directory, to make it just like the package on the pypi.

njzjz avatar Oct 14 '22 20:10 njzjz

I can't modify the tensorflow module unfortunately, since its provided by the HPC. The issue with the protobuf might also boil down to some reading/access rights, since only the libraries in /usr/ can be resolved...

romankempt avatar Oct 15 '22 09:10 romankempt