DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] Compiling Error: ModuleNotFoundError: No module named 'cmake' and failed to set dynamic section sizes: bad value

Open SingL3 opened this issue 2 years ago • 1 comments

Describe the bug I am trying to install deepspeed with:

DS_BUILD_OPS=1 pip install deepspeed --global-option="build_ext" --global-option="-j8"

and failed building ninja and deepspeed.

Log output For ninja:

  Building wheel for ninja (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for ninja (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [13 lines of output]
      Traceback (most recent call last):
        File "/mnt/data/conda/envs/deepspeed/bin/cmake", line 5, in <module>
          from cmake import cmake
      ModuleNotFoundError: No module named 'cmake'
      Traceback (most recent call last):
        File "/tmp/pip-build-env-a3h_bgq2/overlay/lib/python3.8/site-packages/skbuild/setuptools_wrap.py", line 645, in setup
          cmkr = cmaker.CMaker(cmake_executable)
        File "/tmp/pip-build-env-a3h_bgq2/overlay/lib/python3.8/site-packages/skbuild/cmaker.py", line 148, in __init__
          self.cmake_version = get_cmake_version(self.cmake_executable)
        File "/tmp/pip-build-env-a3h_bgq2/overlay/lib/python3.8/site-packages/skbuild/cmaker.py", line 105, in get_cmake_version
          raise SKBuildError(msg) from err

      Problem with the CMake installation, aborting build. CMake executable is /mnt/data/conda/envs/deepspeed/bin/cmake
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for ninja

For more info here, I can run this cmd successfully:

(deepspeed) root@a08720c6-3543-4062-b274:/mnt/home/deepspeed# python
Python 3.8.16 (default, Mar  2 2023, 03:21:46)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from cmake import cmake
>>>

For deepspeed: Too long, error like this:

/usr/local/cuda/bin/nvcc -Icsrc/includes -I/mnt/data/conda/envs/deepspeed/lib/python3.8/site-packages/torch/include -I/mnt/data/conda/envs/deepspeed/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/mnt/data/conda/envs/deepspeed/lib/python3.8/site-packages/torch/include/TH -I/mnt/data/conda/envs/deepspeed/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/mnt/data/conda/envs/deepspeed/include/python3.8 -c csrc/random_ltd/pt_binding.cpp -o build/temp.linux-x86_64-cpython-38/csrc/random_ltd/pt_binding.o -O3 -std=c++14 -g -Wno-reorder -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=random_ltd_op -D_GLIBCXX_USE_CXX11_ABI=0
      nvcc fatal   : Unknown option '-Wno-reorder'
      creating build/lib.linux-x86_64-cpython-38
      creating build/lib.linux-x86_64-cpython-38/deepspeed
      creating build/lib.linux-x86_64-cpython-38/deepspeed/ops
      g++ -pthread -B /mnt/data/conda/envs/deepspeed/compiler_compat -Wl,--sysroot=/ -pthread -shared -B /mnt/data/conda/envs/deepspeed/compiler_compat -L/mnt/data/conda/envs/deepspeed/lib -Wl,-rpath=/mnt/data/conda/envs/deepspeed/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-cpython-38/csrc/utils/flatten_unflatten.o -L/mnt/data/conda/envs/deepspeed/lib/python3.8/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-38/deepspeed/ops/utils_op.cpython-38-x86_64-linux-gnu.so
      /mnt/data/conda/envs/deepspeed/compiler_compat/ld: build/temp.linux-x86_64-cpython-38/csrc/utils/flatten_unflatten.o: relocation R_X86_64_TPOFF32 against hidden symbol `_ZZN8pybind116handle15inc_ref_counterEmE7counter' can not be used when making a shared object
      /mnt/data/conda/envs/deepspeed/compiler_compat/ld: failed to set dynamic section sizes: bad value
      collect2: error: ld returned 1 exit status

and

      g++ -pthread -B /mnt/data/conda/envs/deepspeed/compiler_compat -Wl,--sysroot=/ -pthread -shared -B /mnt/data/conda/envs/deepspeed/compiler_compat -L/mnt/data/conda/envs/deepspeed/lib -Wl,-rpath=/mnt/data/conda/envs/deepspeed/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-cpython-38/csrc/transformer/cublas_wrappers.o build/temp.linux-x86_64-cpython-38/csrc/transformer/dropout_kernels.o build/temp.linux-x86_64-cpython-38/csrc/transformer/ds_transformer_cuda.o build/temp.linux-x86_64-cpython-38/csrc/transformer/gelu_kernels.o build/temp.linux-x86_64-cpython-38/csrc/transformer/general_kernels.o build/temp.linux-x86_64-cpython-38/csrc/transformer/normalize_kernels.o build/temp.linux-x86_64-cpython-38/csrc/transformer/softmax_kernels.o build/temp.linux-x86_64-cpython-38/csrc/transformer/transform_kernels.o -L/mnt/data/conda/envs/deepspeed/lib/python3.8/site-packages/torch/lib -L/usr/local/cuda/lib64 -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda_cu -ltorch_cuda_cpp -o build/lib.linux-x86_64-cpython-38/deepspeed/ops/transformer/stochastic_transformer_op.cpython-38-x86_64-linux-gnu.so
      error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1

To Reproduce

  • Create a env using conda with python=3.8 and activate
  • Install pytorch 1.13.1
  • Because system cmake version is 3.10 so I pip install cmake==3.26.3 and I soft link the system cmake to this one.
  • Install libaio
  • apt install ninja-build
  • Install triton using pip install triton==1.0.0
  • run DS_BUILD_OPS=1 pip install deepspeed --global-option="build_ext" --global-option="-j8"
  • Got error: fatal error: cuda_profiler_api.h: No such file or directory
  • Check issue and see #2682 so I do as this reply: export PATH=/usr/local/cuda/bin:$PATH
  • run DS_BUILD_OPS=1 pip install deepspeed --global-option="build_ext" --global-option="-j8"
  • Error as mentioned above

Expected behavior Successfully installed.

ds_report output N/A

Screenshots For ninja: image For deepspeed: image

System info (please complete the following information):

  • OS: Ubuntu 18.04
  • GPU count and types: one machine with one A100
  • (if applicable) what DeepSpeed-MII version are you using
  • (if applicable) Hugging Face Transformers/Accelerate/etc. versions
  • Python version: 3.8.16
  • Torch 1.13.1
  • cuda 11.2
  • cmake 3.26.3
  • triton 1.0.0

Docker context Are you using a specific docker image that you can share? Wasn't using a docker. Additional context Add any other context about the problem here.

SingL3 avatar Apr 27 '23 08:04 SingL3

I solved this by simply removed additional options, i.e. DS_BUILD_OPS=1 pip install deepspeed

gfzum avatar May 06 '23 02:05 gfzum

I solved this by simply removed additional options, i.e. DS_BUILD_OPS=1 pip install deepspeed

This is what deepspeed ReadMe suggests.

duli2012 avatar May 12 '23 17:05 duli2012