server icon indicating copy to clipboard operation
server copied to clipboard

build error: The batch manager library is truncated or incomplete

Open nullxjx opened this issue 1 year ago • 2 comments

Description A clear and concise description of what the bug is.

build command:

BASE_CONTAINER_IMAGE_NAME=nvcr.io/nvidia/tritonserver:23.12-py3-min
TENSORRTLLM_BACKEND_REPO_TAG=main
PYTHON_BACKEND_REPO_TAG=main

# Run the build script. The flags for some features or endpoints can be removed if not needed.
sudo ./build.py -v --no-container-interactive --enable-logging --enable-stats --enable-tracing \
              --enable-metrics --enable-gpu-metrics --enable-cpu-metrics \
              --filesystem=gcs --filesystem=s3 --filesystem=azure_storage \
              --endpoint=http --endpoint=grpc --endpoint=sagemaker --endpoint=vertex-ai \
              --backend=ensemble --enable-gpu --endpoint=http --endpoint=grpc \
              --image=base,${BASE_CONTAINER_IMAGE_NAME} \
              --backend=tensorrtllm:${TENSORRTLLM_BACKEND_REPO_TAG} \
              --backend=python:${PYTHON_BACKEND_REPO_TAG}

build error info:

-- The CXX compiler identification is GNU 11.4.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- NVTX is disabled
-- Importing batch manager
-- Building PyTorch
-- Building Google tests
-- Building benchmarks
-- Looking for a CUDA compiler
-- Looking for a CUDA compiler - /usr/local/cuda/bin/nvcc
-- CUDA compiler: /usr/local/cuda/bin/nvcc
-- GPU architectures: 70-real;80-real;86-real;89-real;90-real
-- The C compiler identification is GNU 11.4.0
-- The CUDA compiler identification is NVIDIA 12.3.107
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found CUDAToolkit: /usr/local/cuda/include (found version "12.3.107") 
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- CUDA library status:
--     version: 12.3.107
--     libraries: /usr/local/cuda/lib64
--     include path: /usr/local/cuda/targets/x86_64-linux/include
-- ========================= Importing and creating target nvinfer ==========================
-- Looking for library nvinfer
-- Library that was found /usr/lib/x86_64-linux-gnu/libnvinfer.so
-- ==========================================================================================
-- CUDAToolkit_VERSION 12.3 is greater or equal than 11.0, enable -DENABLE_BF16 flag
-- CUDAToolkit_VERSION 12.3 is greater or equal than 11.8, enable -DENABLE_FP8 flag
-- Found MPI_C: /opt/hpcx/ompi/lib/libmpi.so (found version "3.1") 
-- Found MPI_CXX: /opt/hpcx/ompi/lib/libmpi.so (found version "3.1") 
-- Found MPI: TRUE (found version "3.1")  
-- COMMON_HEADER_DIRS: /tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp;/usr/local/cuda/include
-- Found Python3: /usr/bin/python3.10 (found version "3.10.12") found components: Interpreter Development Development.Module Development.Embed 
-- USE_CXX11_ABI is set by python Torch to 0
-- TORCH_CUDA_ARCH_LIST: 7.0;8.0;8.6;8.9;9.0
-- Found Python executable at /usr/bin/python3.10
-- Found Python libraries at /usr/lib/x86_64-linux-gnu
-- Found CUDA: /usr/local/cuda (found version "12.3") 
-- Found CUDAToolkit: /usr/local/cuda/include (found version "12.3.107") 
-- Caffe2: CUDA detected: 12.3
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 12.3
-- /usr/local/cuda-12.3/targets/x86_64-linux/lib/libnvrtc.so shorthash is e150bf88
-- USE_CUDNN is set to 0. Compiling without cuDNN support
-- USE_CUSPARSELT is set to 0. Compiling without cuSPARSELt support
-- Added CUDA NVCC flags for: -gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_89,code=sm_89;-gencode;arch=compute_90,code=sm_90
CMake Warning at /usr/local/lib/python3.10/dist-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message):
  static library kineto_LIBRARY-NOTFOUND not found.
Call Stack (most recent call first):
  /usr/local/lib/python3.10/dist-packages/torch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found)
  CMakeLists.txt:337 (find_package)


-- Found Torch: /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so  
-- TORCH_CXX_FLAGS: -D_GLIBCXX_USE_CXX11_ABI=0
CMake Error at CMakeLists.txt:362 (file):
  file STRINGS file "/usr/local/tensorrt/include/NvInferVersion.h" cannot be
  read.


CMake Error at CMakeLists.txt:365 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


CMake Error at CMakeLists.txt:367 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


CMake Error at CMakeLists.txt:365 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


CMake Error at CMakeLists.txt:367 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


CMake Error at CMakeLists.txt:365 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


CMake Error at CMakeLists.txt:367 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


CMake Error at CMakeLists.txt:365 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


CMake Error at CMakeLists.txt:367 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


-- Building for TensorRT version: .., library version: 
-- Using MPI_C_INCLUDE_DIRS: /opt/hpcx/ompi/include;/opt/hpcx/ompi/include/openmpi;/opt/hpcx/ompi/include/openmpi/opal/mca/hwloc/hwloc201/hwloc/include;/opt/hpcx/ompi/include/openmpi/opal/mca/event/libevent2022/libevent;/opt/hpcx/ompi/include/openmpi/opal/mca/event/libevent2022/libevent/include
-- Using MPI_C_LIBRARIES: /opt/hpcx/ompi/lib/libmpi.so
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- Operating System: ubuntu, 22.04
CMake Error at tensorrt_llm/CMakeLists.txt:105 (message):
  The batch manager library is truncated or incomplete.  This is usually
  caused by using Git LFS (Large File Storage) incorrectly.  Please try
  running command `git lfs install && git lfs pull`.


-- Configuring incomplete, errors occurred!
Traceback (most recent call last):
  File "/tmp/tritonbuild/tensorrtllm/tensorrt_llm/scripts/build_wheel.py", line 306, in <module>
    main(**vars(args))
  File "/tmp/tritonbuild/tensorrtllm/tensorrt_llm/scripts/build_wheel.py", line 160, in main
    build_run(
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'cmake -DCMAKE_BUILD_TYPE="Release" -DBUILD_PYT="ON" -DBUILD_PYBIND="ON"  -DTRT_LIB_DIR=/usr/local/tensorrt/targets/x86_64-linux-gnu/lib -DTRT_INCLUDE_DIR=/usr/local/tensorrt/include  -S "/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp"' returned non-zero exit status 1.
CMake Error at CMakeLists.txt:314 (execute_process):
  execute_process failed command indexes:

    1: "Child return code: 1"



-- Configuring incomplete, errors occurred!
error: build failed
thexjx@instance-1:~/server$ 
thexjx@instance-1:~/server$

Triton Information What version of Triton are you using?

Are you using the Triton container or did you build it yourself?

To Reproduce Steps to reproduce the behavior.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Expected behavior A clear and concise description of what you expected to happen.

nullxjx avatar Jan 26 '24 07:01 nullxjx

Hello, it is more recommended to use the compatible tags for the backends for each release version and not main. As you can see from the Deep Learning Frameworks Support Matrix The compatible version for trtllm backend is v0.7.0.

BASE_CONTAINER_IMAGE_NAME=nvcr.io/nvidia/tritonserver:23.12-py3-min
TENSORRTLLM_BACKEND_REPO_TAG=v0.7.0
PYTHON_BACKEND_REPO_TAG=r23.12

Often the top of tree main is the bleeding edge of Triton and its backends. At times, there might be build issues between the main repos that we are working on to fix, so it is recommended to use more stable versions. Although it seems that your command should work, cc: @krishung5 have you seen this issue before?

jbkyang-nvi avatar Jan 27 '24 01:01 jbkyang-nvi

I remember I saw something similar when TRT was not correctly installed or the path was not set. The main includes some new changes related to the build, which is not compatible with r23.12. As Katherine mentioned, the tags should be aligned based on the support matrix. Please update the tag and let us know if you are still seeing this issue.

krishung5 avatar Jan 29 '24 18:01 krishung5

Closing due to lack of activity. Please re-open the issue if you would like to follow up with this issue.

krishung5 avatar May 30 '24 18:05 krishung5