fastertransformer_backend
fastertransformer_backend copied to clipboard
Not getting response with warning "response is nullptr"
Description
The problem: with "dynamic_batching" enabled, Triton inference server sometimes doesn't respond properly and logging "response is nullptr" several times, and sometimes crash.
The model is a pretty standard BERT model, downloaded from here, and then converted using huggingface_bert_convert.py
script. The model is deployed with Triton inference server with ft_backend enabled.
In config.pbtxt
, the input sequence length is modified to a fixed value 384, and the hidden state dim to 768 (according to bert_model.config)
perf_analyzer
is used for benchmarking with custom generated data. This issue happens when "dynamic_batching {}" was added to config.pbtxt
and concurrency >= 3 (with some randomness).
Another observation is that the server may crash if is_remove_padding
equals to 1, with following output:
terminate called after throwing an instance of 'std::runtime_error'
what(): [FT][ERROR] CUDA runtime error: an illegal memory access was encountered /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/triton_backend/bert/BertTritonModelInstance.cc:106
Signal (6) received.
0# 0x000055D1DC9BF459 in /opt/tritonserver/bin/tritonserver
1# 0x00007F816A83A090 in /usr/lib/x86_64-linux-gnu/libc.so.6
2# gsignal in /usr/lib/x86_64-linux-gnu/libc.so.6
3# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
4# 0x00007F816ABF3911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
5# 0x00007F816ABFF38C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
6# 0x00007F816ABFF3F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
7# 0x00007F816ABFF6A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
8# void fastertransformer::check<cudaError>(cudaError, char const*, char const*, int) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
9# BertTritonModelInstance<__half>::forward(std::shared_ptr<std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, triton::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, triton::Tensor> > > >) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
10# 0x00007F8160043BA2 in /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so
11# 0x00007F816AC2BDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
12# 0x00007F816BFA3609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
13# clone in /usr/lib/x86_64-linux-gnu/libc.so.6
If is_remove_padding
equals to 0, the server doesn't crash, only output multiple lines of "response is nullptr". And perf_analyzer
logs following output before it quits.
Request concurrency: 5
Failed to maintain requested inference load. Worker thread(s) failed to generate concurrent requests.
Thread [1] had error: pinned buffer: failed to perform CUDA copy: invalid argument
Thread [2] had error: pinned buffer: failed to perform CUDA copy: invalid argument
Thread [4] had error: pinned buffer: failed to perform CUDA copy: invalid argument
System info
-
GPU: GeForce RTX 2080Ti
-
GPU Driver: 525.60.11
-
Docker: 19.03.13
-
nvidia-container-runtime:
- runc version 1.0.0-rc6+dev
- commit: 96ec2177ae841256168fcf76954f7177af9446eb
- spec: 1.0.1-dev
-
nvidia-container-cli:
- version: 1.3.0
- build date: 2020-09-16T12:35+0000
- build revision: 16315ebdf4b9728e899f615e208b50c41d7a5d15
-
branch: main
-
commit: 69a397e2b0d9c74d841bd00d6ccdb6410da6316f
-
docker base image: nvcr.io/nvidia/tritonserver:22.12-py3
Reproduced Steps
To Prepare the model, download the bert model and convert it with following command.
# assume the model was in /workspace/models/
mv /workspace/models/{bert_,}config.json
sed -i 's/}/ "bert_type":"bert"\n}/' /workspace/models/config.json
python3 /workspace/FasterTransformer/examples/pytorch/bert/utils/huggingface_bert_convert.py \
-infer_tensor_para_size 1 \
-in_file /workspace/models/ \
-saved_dir /workspace/models_out
mkdir -p triton-model-store/bert-fp32/
mv /workspace/models_out triton-model-store/bert-fp32/1
cat <<EOF > triton-model-store/bert-fp32/config.pbtxt
name: "bert-fp32"
backend: "fastertransformer"
default_model_filename: "bert"
max_batch_size: 1024
input [
{
name: "input_hidden_state"
dims: [ 384, 768 ]
},
{
name: "sequence_lengths"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
}
]
output [
{
name: "output_hidden_state"
dims: [ -1, -1 ]
}
]
instance_group [
{
count: 1
kind : KIND_CPU
}
]
parameters {
key: "tensor_para_size"
value: {
string_value: "1"
}
}
parameters {
key: "pipeline_para_size"
value: {
string_value: "1"
}
}
parameters {
key: "data_type"
value: {
string_value: "fp32"
}
}
parameters {
key: "enable_custom_all_reduce"
value: {
string_value: "0"
}
}
parameters {
key: "model_type"
value: {
string_value: "bert"
}
}
parameters {
key: "model_checkpoint_path"
value: {
string_value: "triton-model-store/bert-fp32/1/1-gpu/"
}
}
parameters {
key: "int8_mode"
value: {
string_value: "0"
}
}
parameters {
key: "is_sparse"
value: {
string_value: "0"
}
}
parameters {
key: "is_remove_padding"
value: {
string_value: "0"
}
}
dynamic_batching {
}
EOF
Build the docker:
git clone https://github.com/triton-inference-server/fastertransformer_backend.git
cd fastertransformer_backend
export WORKSPACE=$(pwd)
export CONTAINER_VERSION=22.12
export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}
docker build --rm \
--build-arg TRITON_VERSION=${CONTAINER_VERSION} \
-t ${TRITON_DOCKER_IMAGE} \
-f docker/Dockerfile \
.
Run the server
docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --gpus=all --shm-size=4G -v $(pwd):/ft_workspace --network host --name nv_ft triton_with_ft:22.12 bash
cd /ft_workspace
CUDA_VISIBLE_DEVICES=0 /opt/tritonserver/bin/tritonserver --model-repository=/ft_workspace/triton-model-store
This python script was used to generate random test data:
import numpy as np
data = np.random.random([384,768]).astype(np.float32) # np.float16
data.tofile('input_hidden_state')
L = np.asarray([384]).astype(np.int32)
L.tofile('sequence_lengths')
Generated data was put in mockdata
folder. Then run perf_analyzer
.
/path/to/perf_analyzer -m bert-fp32 --input-data ./mockdata/ --concurrency-range 1:10
Can you reproduce this issue by python example? You may not really use the data you generate.