fastertransformer_backend
fastertransformer_backend copied to clipboard
triton server crashed after reload the same model
Description
Host: linux amd64
GPU: RTX 3060
container version:22.12
GPT model converted from megatron (model files and configs are from gpt guide)
dockerfile:
----
ARG TRITON_SERVER_VERSION
FROM nvcr.io/nvidia/tritonserver:${TRITON_SERVER_VERSION}-py3
ENV TRITON_SERVER_USER=triton-server
RUN if ! id -u $TRITON_SERVER_USER > /dev/null 2>&1 ; then useradd $TRITON_SERVER_USER; fi && [ `id -u $TRITON_SERVER_USER` -eq 1000 ] && [ `id -g $TRITON_SERVER_USER` -eq 1000 ]
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends autoconf rapidjson-dev libz-dev libgomp1 && \
apt-get clean && rm -rf /var/lib/apt/lists/*
RUN pip3 install cmake==3.24.3
# backend build
RUN mkdir -p /workspace/build
WORKDIR /workspace/build
RUN git clone https://github.com/triton-inference-server/fastertransformer_backend
RUN mkdir -p /workspace/build/fastertransformer_backend/build
WORKDIR /workspace/build/fastertransformer_backend/build
ARG FORCE_BACKEND_REBUILD=0
RUN cmake \
-D CMAKE_EXPORT_COMPILE_COMMANDS=1 \
-D CMAKE_BUILD_TYPE=Release \
-D CMAKE_INSTALL_PREFIX=/opt/tritonserver \
-D TRITON_COMMON_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
-D TRITON_CORE_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
-D TRITON_BACKEND_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
..
RUN make -j"$(grep -c ^processor /proc/cpuinfo)" install
WORKDIR /opt/tritonserver
RUN chown $TRITON_SERVER_USER:$TRITON_SERVER_USER /opt/tritonserver/backends/fastertransformer
RUN rm -rf /workspace/*
ENV NCCL_LAUNCH_MODE=GROUP
USER $TRITON_SERVER_USER
----
Reproduced Steps
1. docker run --gpus=all -p 8000:8000 -p 8001:8001 -p 8002:8002 -it --rm --shm-size=4g --ulimit memlock=-1 -v /model_repository:/model-repository tritonserver_with_ft:22.12 /bin/bash
2. tritonserver --model-store=/model-repository/gpt --model-control-mode=explicit --allow-http=true --strict-model-config=false
3. curl -X POST localhost:8000/v2/repository/models/gpt/load
at this point inference request run fine
4. curl -X POST localhost:8000/v2/repository/models/gpt/unload
5. curl -X POST localhost:8000/v2/repository/models/fastertransformer/unload
6. curl -X POST localhost:8000/v2/repository/models/preprocessing/unload
7. curl -X POST localhost:8000/v2/repository/models/postprocessing/unload
all unloaded fine
I0223 17:19:18.565427 377 model_lifecycle.cc:579] successfully unloaded 'gpt' version 1
I0223 17:19:28.384488 377 libfastertransformer.cc:1965] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0223 17:19:28.384682 377 libfastertransformer.cc:1899] TRITONBACKEND_ModelFinalize: delete model state
I0223 17:19:28.384747 377 libfastertransformer.cc:1904] TRITONBACKEND_ModelFinalize: MPI Finalize
I0223 17:19:28.437534 377 model_lifecycle.cc:579] successfully unloaded 'fastertransformer' version 1
Cleaning up...
I0223 17:20:07.072204 377 model_lifecycle.cc:579] successfully unloaded 'preprocessing' version 1
Cleaning up...
I0223 17:20:18.908013 377 model_lifecycle.cc:579] successfully unloaded 'postprocessing' version 1
8. curl -X POST localhost:8000/v2/repository/models/gpt/load
and triton crashed with following outputs:
[a5fd4e0ba10d:652 :0:659] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid: 659) ====
0 0x0000000000014420 __funlockfile() ???:0
1 0x000000000007035b ompi_comm_size() /build-result/src/hpcx-v2.13-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-5abd86cc8c5d75c5fe7894b379515d97839c1416/ompi/mpi/c/profile/../../../../ompi/communicator/communicator.h:360
2 0x000000000007035b PMPI_Comm_size() /build-result/src/hpcx-v2.13-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-5abd86cc8c5d75c5fe7894b379515d97839c1416/ompi/mpi/c/profile/pcomm_size.c:63
3 0x0000000000c0a1a1 fastertransformer::mpi::getCommWorldSize() ???:0
4 0x000000000001f8f4 triton::backend::fastertransformer_backend::ModelState::ModelState() :0
5 0x000000000002d784 triton::backend::fastertransformer_backend::ModelState::Create() :0
6 0x000000000002ddfd TRITONBACKEND_ModelInitialize() ???:0
7 0x000000000010689b triton::core::TritonModel::Create() :0
8 0x00000000001c4f5d triton::core::ModelLifeCycle::CreateModel() :0
9 0x00000000001caccd std::_Function_handler<void (), triton::core::ModelLifeCycle::AsyncLoad(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, inference::ModelConfig const&, bool, std::shared_ptr<triton::core::TritonRepoAgentModelList> const&, std::function<void (triton::core::Status)>&&)::{lambda()#1}>::_M_invoke() model_lifecycle.cc:0
10 0x00000000003083a0 std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run() thread_pool.cc:0
11 0x00000000000d6de4 std::error_code::default_error_condition() ???:0
12 0x0000000000008609 start_thread() ???:0
13 0x000000000011f133 clone() ???:0
=================================
Signal (11) received.
0# 0x0000559A80FEC459 in tritonserver
1# 0x00007FE8D0163420 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
2# MPI_Comm_size in /opt/hpcx/ompi/lib/libmpi.so.40
3# fastertransformer::mpi::getCommWorldSize() in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
4# 0x00007FE8C40318F4 in /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so
5# 0x00007FE8C403F784 in /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so
6# TRITONBACKEND_ModelInitialize in /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so
7# 0x00007FE8CF29689B in /opt/tritonserver/bin/../lib/libtritonserver.so
8# 0x00007FE8CF354F5D in /opt/tritonserver/bin/../lib/libtritonserver.so
9# 0x00007FE8CF35ACCD in /opt/tritonserver/bin/../lib/libtritonserver.so
10# 0x00007FE8CF4983A0 in /opt/tritonserver/bin/../lib/libtritonserver.so
11# 0x00007FE8CEDDDDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
12# 0x00007FE8D0157609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
13# clone in /usr/lib/x86_64-linux-gnu/libc.so.6
Segmentation fault (core dumped)
Because FT uses MPI to support multi-node inference, FT does not support unload and load the model now. We are working on it. Thank you for the feedback.
Hi is there any expected timeline for this to be resolved? thank you guys for the hard works