server crashs when traffic is a little bit high
Description
main branch, V100
Deployed docker pods crashs and restarts every few minutes. It seems stable when qps is low.
Below is error log before pods crashs which I get using command: kubectl logs <pod-name> --all-containers -p
0# 0x000055F47681BC19 in tritonserver
1# 0x00007F17168FE090 in /usr/lib/x86_64-linux-gnu/libc.so.6
2# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
3# 0x00007F1716CB7911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
4# 0x00007F1716CC338C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
5# 0x00007F1716CC33F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
6# 0x00007F1716CC36A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
7# void fastertransformer::check<cudaError>(cudaError, char const*, char const*, int) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
8# BertTritonModelInstance<__half>::forward(std::shared_ptr<std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, triton::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, triton::Tensor> > > >) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
9# 0x00007F170C34EE0A in /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so
10# 0x00007F1716CEFDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
11# 0x00007F1717F04609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
12# clone in /usr/lib/x86_64-linux-gnu/libc.so.6
Also I observed below error response from triton client sometimes:
[StatusCode.INTERNAL] pinned buffer: failed to perform CUDA copy: invalid argument
Maybe the client error is related to server pod crash.
Thanks so much if it could get fixed.
Reproduced Steps
git clone https://github.com/triton-inference-server/fastertransformer_backend.git
cd fastertransformer_backend
export WORKSPACE=$(pwd)exportCONTAINER_VERSION=22.07
export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}
docker build --rm \
--build-arg TRITON_VERSION=${CONTAINER_VERSION}\
-t ${TRITON_DOCKER_IMAGE}\
-f docker/Dockerfile \
docker tag ${TRITON_DOCKER_IMAGE}<github_or_gitlab/repo_name/image_name>:${CONTAINER_VERSION}
docker push <github_or_gitlab/repo_name/image_name>:${CONTAINER_VERSION}
Upload ft model folders and deploy triton services to docker, then test using triton client API
BTW, my model is BERT, any hints?
@rahuan can you try to run it with the latest triton servers (rebuild the image if you are not using the latest one) ? and enable the verbose logging by --log-verbose 1?
Also, try without k8s first to see if it is reproducible.
@rahuan can you try to run it with the latest triton servers (rebuild the image if you are not using the latest one) ? and enable the verbose logging by
--log-verbose 1?Also, try without k8s first to see if it is reproducible.
I built the image using the latest code of fastertransformer_backend at Nov. 24th. Also I used --log-verbose=2. Let's try to sync latest code now, and rebuild.
I just synced latest code of fastertransformer_backend now, it fails even faster at a very low qps. below are errors: I1212 06:38:03.990948 1 libfastertransformer.cc:1022] get total batch_size = 1 I1212 06:38:03.990967 1 libfastertransformer.cc:1433] get input count = 2 I1212 06:38:03.991087 1 libfastertransformer.cc:1672] collect name: input_hidden_state size: 353280 bytes I1212 06:38:03.991104 1 libfastertransformer.cc:1672] collect name: sequence_lengths size: 40 bytes I1212 06:38:03.991113 1 libfastertransformer.cc:1683] the data is in CPU I1212 06:38:03.991121 1 libfastertransformer.cc:1690] the data is in CPU I1212 06:38:03.991145 1 libfastertransformer.cc:1380] before ThreadForward 0 I1212 06:38:03.991202 1 libfastertransformer.cc:1388] after ThreadForward 0 I1212 06:38:03.991222 1 libfastertransformer.cc:1226] Start to forward terminate called after throwing an instance of 'std::runtime_error' what(): [FT][ERROR] Assertion fail: /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/layers/attention_layers/FusedAttentionLayer.cu:178
Signal (6) received.
0# 0x000055C30FAAFC19 in tritonserver
1# 0x00007FB022A0E090 in /usr/lib/x86_64-linux-gnu/libc.so.6
2# gsignal in /usr/lib/x86_64-linux-gnu/libc.so.6
3# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
4# 0x00007FB022DC7911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
5# 0x00007FB022DD338C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
6# 0x00007FB022DD33F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
7# 0x00007FB022DD36A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
8# fastertransformer::myAssert(bool, char const*, int, std::__cxx11::basic_string<char, std::char_traits
what seq length you are using ?
what seq length you are using ?
Batch_size is 10 or 20, seq length is different for each sentence in a batch, average is about 50~60, but the max seq length is limited to 128
thanks. Can you also share the head-size you are using? that will be helpful for us to reproduce.
The model settings are the same as bert-base chinese, layer num is 12, head num is 12, hidden size is 768 = 64*12, thanks! BTW, the data_type is fp16, is_remove_padding is set to 1
@PerkzZheng, may I ask if any findings about this issue?
@rahuan sorry for the late response.
You can print out the before this line here. You might have changed how s is given, and we can not find the corresponding kernel.