fastertransformer_backend server crashs when traffic is a little bit high

Description

main branch, V100

Deployed docker pods crashs and restarts every few minutes. It seems stable when qps is low.
Below is error log before pods crashs which I get using command: kubectl logs <pod-name> --all-containers -p
 0# 0x000055F47681BC19 in tritonserver
 1# 0x00007F17168FE090 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# 0x00007F1716CB7911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 4# 0x00007F1716CC338C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 5# 0x00007F1716CC33F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x00007F1716CC36A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 7# void fastertransformer::check<cudaError>(cudaError, char const*, char const*, int) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
 8# BertTritonModelInstance<__half>::forward(std::shared_ptr<std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, triton::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, triton::Tensor> > > >) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
 9# 0x00007F170C34EE0A in /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so
10# 0x00007F1716CEFDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
11# 0x00007F1717F04609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
12# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Also I observed below error response from triton client sometimes:
[StatusCode.INTERNAL] pinned buffer: failed to perform CUDA copy: invalid argument

Maybe the client error is related to server pod crash.
Thanks so much if it could get fixed.

Reproduced Steps

git clone https://github.com/triton-inference-server/fastertransformer_backend.git
cd fastertransformer_backend
export WORKSPACE=$(pwd)exportCONTAINER_VERSION=22.07
export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}
docker build --rm   \
    --build-arg TRITON_VERSION=${CONTAINER_VERSION}\
    -t ${TRITON_DOCKER_IMAGE}\
    -f docker/Dockerfile \
docker tag ${TRITON_DOCKER_IMAGE}<github_or_gitlab/repo_name/image_name>:${CONTAINER_VERSION}
docker push <github_or_gitlab/repo_name/image_name>:${CONTAINER_VERSION}

Upload ft model folders and deploy triton services to docker, then test using triton client API

Dec 11 '22 15:12 rahuan

BTW, my model is BERT, any hints?

Dec 11 '22 15:12 rahuan

@rahuan can you try to run it with the latest triton servers (rebuild the image if you are not using the latest one) ? and enable the verbose logging by --log-verbose 1?

Also, try without k8s first to see if it is reproducible.

Dec 12 '22 02:12 PerkzZheng

@rahuan can you try to run it with the latest triton servers (rebuild the image if you are not using the latest one) ? and enable the verbose logging by --log-verbose 1?

Also, try without k8s first to see if it is reproducible.

I built the image using the latest code of fastertransformer_backend at Nov. 24th. Also I used --log-verbose=2. Let's try to sync latest code now, and rebuild.

Dec 12 '22 02:12 rahuan

I just synced latest code of fastertransformer_backend now, it fails even faster at a very low qps. below are errors: I1212 06:38:03.990948 1 libfastertransformer.cc:1022] get total batch_size = 1 I1212 06:38:03.990967 1 libfastertransformer.cc:1433] get input count = 2 I1212 06:38:03.991087 1 libfastertransformer.cc:1672] collect name: input_hidden_state size: 353280 bytes I1212 06:38:03.991104 1 libfastertransformer.cc:1672] collect name: sequence_lengths size: 40 bytes I1212 06:38:03.991113 1 libfastertransformer.cc:1683] the data is in CPU I1212 06:38:03.991121 1 libfastertransformer.cc:1690] the data is in CPU I1212 06:38:03.991145 1 libfastertransformer.cc:1380] before ThreadForward 0 I1212 06:38:03.991202 1 libfastertransformer.cc:1388] after ThreadForward 0 I1212 06:38:03.991222 1 libfastertransformer.cc:1226] Start to forward terminate called after throwing an instance of 'std::runtime_error' what(): [FT][ERROR] Assertion fail: /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/layers/attention_layers/FusedAttentionLayer.cu:178

Signal (6) received. 0# 0x000055C30FAAFC19 in tritonserver 1# 0x00007FB022A0E090 in /usr/lib/x86_64-linux-gnu/libc.so.6 2# gsignal in /usr/lib/x86_64-linux-gnu/libc.so.6 3# abort in /usr/lib/x86_64-linux-gnu/libc.so.6 4# 0x00007FB022DC7911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6 5# 0x00007FB022DD338C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6 6# 0x00007FB022DD33F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6 7# 0x00007FB022DD36A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6 8# fastertransformer::myAssert(bool, char const*, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so 9# fastertransformer::FusedAttentionLayer<__half>::forward(fastertransformer::TensorMap*, fastertransformer::TensorMap*, fastertransformer::AttentionWeight<__half> const*) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so 10# fastertransformer::Bert<__half>::forward(fastertransformer::TensorMap*, fastertransformer::TensorMap*, fastertransformer::BertWeight<__half> const*) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so 11# BertTritonModelInstance<__half>::forward(std::shared_ptr<std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, triton::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, triton::Tensor> > > >) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so 12# 0x00007FB0184521F7 in /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so 13# 0x00007FB022DFFDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6 14# 0x00007FB024014609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0 15# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Dec 12 '22 06:12 rahuan

what seq length you are using ?

Dec 12 '22 07:12 PerkzZheng

what seq length you are using ?

Batch_size is 10 or 20, seq length is different for each sentence in a batch, average is about 50~60, but the max seq length is limited to 128

Dec 12 '22 08:12 rahuan

thanks. Can you also share the head-size you are using? that will be helpful for us to reproduce.

Dec 12 '22 09:12 PerkzZheng

The model settings are the same as bert-base chinese, layer num is 12, head num is 12, hidden size is 768 = 64*12, thanks! BTW, the data_type is fp16, is_remove_padding is set to 1

Dec 12 '22 09:12 rahuan

@PerkzZheng, may I ask if any findings about this issue?

Dec 14 '22 03:12 rahuan

@rahuan sorry for the late response. You can print out the before this line here. You might have changed how s is given, and we can not find the corresponding kernel.

Feb 28 '23 10:02 PerkzZheng