ktransformers icon indicating copy to clipboard operation
ktransformers copied to clipboard

[Bug] 在Ubuntu下编译失败

Open JasperJin01 opened this issue 4 months ago • 2 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
  • [x] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.

Describe the bug

我尝试在ubuntu上编译该项目,但是始终失败,这困扰了我很久。 我参考了[0.2.4版本的安装说明](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/balance-serve.md) ,还有一些第三方的安装教程。我安装的是0.3.2版本的,但是我并没有找到该版本的专用安装说明。

我的设备环境:

  • Ubuntu 24
  • Intel(R) Xeon(R) Gold 6454S NUMA x2
  • gcc 11.4,g++ 11.4, cmake 3.28
  • cuda 12.8

我尝试了很多次安装命令,始终有报错。执行的过程中有cmake使用了conda中的环境导致的找不到包、gcc/g++版本问题导致Float相关的报错。在我解决了一系列问题之后依然存在问题。我尝试和ai一起编译项目:

执行安装命令:

export USE_NUMA=1
export PATH=/usr/bin:/usr/local/cuda-12.8/bin:$PATH && export PYTHONPATH=/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages:$PYTHONPATH && env USE_BALANCE_SERVE=1 bash ./install.sh 

报错:

  [3/3] /usr/local/cuda-12.8/bin/nvcc --generate-dependencies-with-compile --dependency-output /data1/jinjm_data/dev/k32/ktransformers/build/temp.linux-x86_64-cpython-311/csrc/custom_marlin/gptq_marlin/gptq_marlin.o.d -I/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/torch/include -I/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda-12.8/include -I/data1/jinjm_data/miniconda3/envs/kt/include/python3.11 -c -c /data1/jinjm_data/dev/k32/ktransformers/csrc/custom_marlin/gptq_marlin/gptq_marlin.cu -o /data1/jinjm_data/dev/k32/ktransformers/build/temp.linux-x86_64-cpython-311/csrc/custom_marlin/gptq_marlin/gptq_marlin.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -Xcompiler -fPIC -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=vLLMMarlin -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 -std=c++17
  g++ -pthread -B /data1/jinjm_data/miniconda3/envs/kt/compiler_compat -shared -Wl,-rpath,/data1/jinjm_data/miniconda3/envs/kt/lib -Wl,-rpath-link,/data1/jinjm_data/miniconda3/envs/kt/lib -L/data1/jinjm_data/miniconda3/envs/kt/lib -Wl,-rpath,/data1/jinjm_data/miniconda3/envs/kt/lib -Wl,-rpath-link,/data1/jinjm_data/miniconda3/envs/kt/lib -L/data1/jinjm_data/miniconda3/envs/kt/lib /data1/jinjm_data/dev/k32/ktransformers/build/temp.linux-x86_64-cpython-311/csrc/custom_marlin/binding.o /data1/jinjm_data/dev/k32/ktransformers/build/temp.linux-x86_64-cpython-311/csrc/custom_marlin/gptq_marlin/gptq_marlin.o /data1/jinjm_data/dev/k32/ktransformers/build/temp.linux-x86_64-cpython-311/csrc/custom_marlin/gptq_marlin/gptq_marlin_repack.o -L/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/torch/lib -L/usr/local/cuda-12.8/lib64 -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-311/vLLMMarlin.cpython-311-x86_64-linux-gnu.so
  -- Using compiler: /usr/bin/g++-13
  -- _GLIBCXX_USE_CXX11_ABI=1
  -- Could NOT find benchmark (missing: benchmark_DIR)
  -- The following features have been enabled:

   * Pull, support for pulling metrics
   * pkg-config, generate pkg-config files

  -- The following OPTIONAL packages have been found:

   * CURL

  -- The following REQUIRED packages have been found:

   * googlemock-3rdparty
   * civetweb-3rdparty

  -- The following features have been disabled:

   * Push, support for pushing metrics to a push-gateway
   * Compression, support for zlib compression of metrics
   * IYWU, include-what-you-use

  -- The following OPTIONAL packages have not been found:

   * benchmark

  -- xxHash build type: Release
  -- Architecture: x86_64
  -- pybind11 v2.14.0 dev1
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/torch/__init__.py", line 1002, in <module>
      raise ImportError(
  ImportError: Failed to load PyTorch C extensions:
      It appears that PyTorch has loaded the `torch/_C` folder
      of the PyTorch repository rather than the C extensions which
      are expected in the `torch._C` namespace. This can occur when
      using the `install` workflow. e.g.
          $ python setup.py install && python -c "import torch"

      This error can generally be solved using the `develop` workflow
          $ python setup.py develop && python -c "import torch"  # This should succeed
      or by running Python from a different directory.
  -- Found PyTorch at:
  -- PyTorch: CUDA detected: 12.8
  -- PyTorch: CUDA nvcc is: /usr/local/cuda-12.8/bin/nvcc
  -- PyTorch: CUDA toolkit directory: /usr/local/cuda-12.8
  -- PyTorch: Header version is: 12.8
  CMake Warning at /data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:140 (message):
    Failed to compute shorthash for libnvrtc.so
  Call Stack (most recent call first):
    /data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:86 (include)
    /data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
    CMakeLists.txt:81 (find_package)


  -- USE_CUDNN is set to 0. Compiling without cuDNN support
  -- USE_CUSPARSELT is set to 0. Compiling without cuSPARSELt support
  -- USE_CUDSS is set to 0. Compiling without cuDSS support
  -- USE_CUFILE is set to 0. Compiling without cuFile support
  -- Autodetected CUDA architecture(s):  9.0 9.0 9.0
  -- Added CUDA NVCC flags for: -gencode;arch=compute_90,code=sm_90
  CMake Warning at /data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message):
    static library kineto_LIBRARY-NOTFOUND not found.
  Call Stack (most recent call first):
    /data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:125 (append_torchlib_if_found)
    CMakeLists.txt:81 (find_package)


  -- _GLIBCXX_USE_CXX11_ABI=1
  -- Using aio
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/torch/__init__.py", line 1002, in <module>
      raise ImportError(
  ImportError: Failed to load PyTorch C extensions:
      It appears that PyTorch has loaded the `torch/_C` folder
      of the PyTorch repository rather than the C extensions which
      are expected in the `torch._C` namespace. This can occur when
      using the `install` workflow. e.g.
          $ python setup.py install && python -c "import torch"

      This error can generally be solved using the `develop` workflow
          $ python setup.py develop && python -c "import torch"  # This should succeed
      or by running Python from a different directory.
  -- Found PyTorch at:
  CMake Warning at /data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message):
    static library kineto_LIBRARY-NOTFOUND not found.
  Call Stack (most recent call first):
    /data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:125 (append_torchlib_if_found)
    kvc2/CMakeLists.txt:70 (find_package)


  CMake Warning (dev) at kvc2/CMakeLists.txt:77 (find_package):
    Policy CMP0146 is not set: The FindCUDA module is removed.  Run "cmake
    --help-policy CMP0146" for policy details.  Use the cmake_policy command to
    set the policy and suppress this warning.

  This warning is for project developers.  Use -Wno-dev to suppress it.

  -- prometheus Found!
  -- CUDA Found!
  -- CUDA Version: 12.8
  -- CUDA Toolkit Root: /usr/local/cuda-12.8
  -- CMAKE_SOURCE_DIR: /data1/jinjm_data/dev/k32/ktransformers/csrc/balance_serve
  -- BUILD_PYTHON_EXT: OFF
  -- CMAKE_CXX_FLAGS of PhotonLibOS: -O3 -march=native
  -- Checking dependency zlib
  -- Will find zlib
  -- Checking dependency openssl
  -- Will find openssl
  -- Checking dependency aio
  -- Will find aio
  -- Checking dependency curl
  -- Will find curl
  -- Configuring done (2.3s)
  -- Generating done (0.1s)
  -- Build files have been written to: /data1/jinjm_data/dev/k32/ktransformers/csrc/balance_serve
  Error: could not load cache
  CMake args: ['-DCMAKE_LIBRARY_OUTPUT_DIRECTORY=/data1/jinjm_data/dev/k32/ktransformers/build/lib.linux-x86_64-cpython-311/', '-DPYTHON_EXECUTABLE=/data1/jinjm_data/miniconda3/envs/kt/bin/python3.11', '-DCMAKE_BUILD_TYPE=Release', '-DKTRANSFORMERS_USE_CUDA=ON', '-D_GLIBCXX_USE_CXX11_ABI=1']
  CMake args: ['-DCMAKE_LIBRARY_OUTPUT_DIRECTORY=/data1/jinjm_data/dev/k32/ktransformers/build/lib.linux-x86_64-cpython-311/', '-DPYTHON_EXECUTABLE=/data1/jinjm_data/miniconda3/envs/kt/bin/python3.11', '-DCMAKE_BUILD_TYPE=Release', '-DKTRANSFORMERS_USE_CUDA=ON', '-D_GLIBCXX_USE_CXX11_ABI=1', '-DLLAMA_NATIVE=ON', '-DEXAMPLE_VERSION_INFO=0.3.1+cu128torch28fancy']
  build_temp: /data1/jinjm_data/dev/k32/ktransformers/csrc/balance_serve/build
  Traceback (most recent call last):
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 389, in <module>
      main()
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 373, in main
      json_out["return_val"] = hook(**hook_input["kwargs"])
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 280, in build_wheel
      return _build_backend().build_wheel(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/setuptools/build_meta.py", line 415, in build_wheel
      return self._build_with_temp_dir(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/setuptools/build_meta.py", line 397, in _build_with_temp_dir
      self.run_setup()
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/setuptools/build_meta.py", line 313, in run_setup
      exec(code, locals())
    File "<string>", line 668, in <module>
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/setuptools/__init__.py", line 103, in setup
      return distutils.core.setup(**attrs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 184, in setup
      return run_commands(dist)
             ^^^^^^^^^^^^^^^^^^
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 200, in run_commands
      dist.run_commands()
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 970, in run_commands
      self.run_command(cmd)
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/setuptools/dist.py", line 974, in run_command
      super().run_command(command)
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 989, in run_command
      cmd_obj.run()
    File "<string>", line 263, in run
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/setuptools/command/bdist_wheel.py", line 373, in run
      self.run_command("build")
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 316, in run_command
      self.distribution.run_command(command)
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/setuptools/dist.py", line 974, in run_command
      super().run_command(command)
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 989, in run_command
      cmd_obj.run()
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/setuptools/_distutils/command/build.py", line 135, in run
      self.run_command(cmd_name)
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 316, in run_command
      self.distribution.run_command(command)
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/setuptools/dist.py", line 974, in run_command
      super().run_command(command)
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 989, in run_command
      cmd_obj.run()
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/setuptools/command/build_ext.py", line 93, in run
      _build_ext.run(self)
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 359, in run
      self.build_extensions()
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1072, in build_extensions
      build_ext.build_extensions(self)
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 479, in build_extensions
      self._build_extensions_serial()
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 505, in _build_extensions_serial
      self.build_extension(ext)
    File "<string>", line 594, in build_extension
    File "<string>", line 370, in run_command_with_live_tail
    File "/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/subprocess.py", line 571, in run
      raise CalledProcessError(retcode, process.args,
  subprocess.CalledProcessError: Command '['cmake', '--build', PosixPath('/data1/jinjm_data/dev/k32/ktransformers/csrc/balance_serve/build'), '--verbose', '--parallel=128']' returned non-zero exit status 1.
  error: subprocess-exited-with-error
  
  × Building wheel for ktransformers (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  full command: /data1/jinjm_data/miniconda3/envs/kt/bin/python3.11 /data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py build_wheel /tmp/tmp8ujfs3rl
  cwd: /data1/jinjm_data/dev/k32/ktransformers
  Building wheel for ktransformers (pyproject.toml) ... error
  ERROR: Failed building wheel for ktransformers
Failed to build ktransformers
ERROR: Failed to build installable wheels for some pyproject.toml based projects (ktransformers)

编译失败,问题是cmake构建过程返回了非零退出状态。balance_serve的构建目录存在但是空的。还有没有makefile的报错。

我并不清楚应该如何解决

Reproduction

export USE_NUMA=1 export PATH=/usr/bin:/usr/local/cuda-12.8/bin:$PATH && export PYTHONPATH=/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages:$PYTHONPATH && env USE_BALANCE_SERVE=1 bash ./install.sh

Environment

  • Ubuntu 24
  • Intel(R) Xeon(R) Gold 6454S NUMA x2
  • gcc 11.4,g++ 11.4, cmake 3.28
  • cuda 12.8

JasperJin01 avatar Aug 29 '25 14:08 JasperJin01

每次都是在执行第三部的时候出现问题。

[3/3] /usr/local/cuda-12.8/bin/nvcc --generate-dependencies-with-compile --dependency-output /data1/jinjm_data/dev/k32/ktransformers/build/temp.linux-x86_64-cpython-311/csrc/custom_marlin/gptq_marlin/gptq_marlin.o.d -I/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/torch/include -I/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda-12.8/include -I/data1/jinjm_data/miniconda3/envs/kt/include/python3.11 -c -c /data1/jinjm_data/dev/k32/ktransformers/csrc/custom_marlin/gptq_marlin/gptq_marlin.cu -o /data1/jinjm_data/dev/k32/ktransformers/build/temp.linux-x86_64-cpython-311/csrc/custom_marlin/gptq_marlin/gptq_marlin.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -Xcompiler -fPIC -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=vLLMMarlin -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 -std=c++17
  g++ -pthread -B /data1/jinjm_data/miniconda3/envs/kt/compiler_compat -shared -Wl,-rpath,/data1/jinjm_data/miniconda3/envs/kt/lib -Wl,-rpath-link,/data1/jinjm_data/miniconda3/envs/kt/lib -L/data1/jinjm_data/miniconda3/envs/kt/lib -Wl,-rpath,/data1/jinjm_data/miniconda3/envs/kt/lib -Wl,-rpath-link,/data1/jinjm_data/miniconda3/envs/kt/lib -L/data1/jinjm_data/miniconda3/envs/kt/lib -L/usr/lib/x86_64-linux-gnu /data1/jinjm_data/dev/k32/ktransformers/build/temp.linux-x86_64-cpython-311/csrc/custom_marlin/binding.o /data1/jinjm_data/dev/k32/ktransformers/build/temp.linux-x86_64-cpython-311/csrc/custom_marlin/gptq_marlin/gptq_marlin.o /data1/jinjm_data/dev/k32/ktransformers/build/temp.linux-x86_64-cpython-311/csrc/custom_marlin/gptq_marlin/gptq_marlin_repack.o -L/data1/jinjm_data/miniconda3/envs/kt/lib/python3.11/site-packages/torch/lib -L/usr/local/cuda-12.8/lib64 -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-311/vLLMMarlin.cpython-311-x86_64-linux-gnu.so

好像是balance_serve模块构建失败。ai尝试单独构建这个模块好像是可以成功,但是一执行install.sh就失败。

JasperJin01 avatar Aug 29 '25 14:08 JasperJin01

首先执行XXX bash install.sh 2>&1 | tee build.log,将编译过程中的所有输出写入到build.log文件中,构建结束后,使用cat build.log | grep error查看哪里出错了,如果看不出来,则尝试将error改为其他关键词找错,比如not found。解决了错误后,一定要将balance_serve构建的缓存文件全部清除,即手动删除csrc/balance_serve/build文件夹,也可以将“rm -rf csrc/balance_serve/build”这一句写入install.sh文件中。官方写的install.sh并没有清理balance_serve的编译缓存,不知道官方为什么不写,还得各种给官方debug,浪费了好多时间。

wqshmzh avatar Sep 19 '25 10:09 wqshmzh