fastertransformer_backend
fastertransformer_backend copied to clipboard
No response is received during inference in decoupled mode.
Description
main branch
V100
my model type is GPTNeoX
Reproduced Steps
https://github.com/triton-inference-server/fastertransformer_backend/blob/main/docs/gptneox_guide.md#decoupled-mode
I ran tritonserver using my model after seeing this article. i changed request_output_len to 512 and sent a request.
No response came after 130-140 tokens were generated, the GPU is still using memory.
I set FT_LOG_LEVEL to DEBUG and did the test again.
No error logs were found while tritonserver was starting.
Error logs were found while generating a response.
[FT][DEBUG] bool fastertransformer::TensorMap::isExist(const string&) const for key: ia3_tasks
[FT][DEBUG] T* fastertransformer::Tensor::getPtr() const [with T = const int] start
[FT][DEBUG] void fastertransformer::cublasMMWrapper::Gemm(cublasOperation_t, cublasOperation_t, int, int, int, const void*, int, const void*, int, void*, int)
[FT][DEBUG] void fastertransformer::cublasMMWrapper::Gemm(cublasOperation_t, cublasOperation_t, int, int, int, const void*, int, const void*, int, void*, int, float, float)
[FT][DEBUG] T* fastertransformer::Tensor::getPtr() const [with T = __half] start
[FT][DEBUG] void fastertransformer::cublasMMWrapper::Gemm(cublasOperation_t, cublasOperation_t, int, int, int, const void*, const void*, cudaDataType_t, int, const void*, cudaDataType_t, int, const void*, void*, cudaDataType_t, int, cudaDataType_t, cublasGemmAlgo_t)
[FT][DEBUG] static std::shared_ptr<std::unordered_map<std::__cxx11::basic_string<char>, triton::Tensor> > GptNeoXTritonModelInstance<T>::convert_outputs(const std::unordered_map<std::__cxx11::basic_string<char>, fastertransformer::Tensor>&) [with T = __half]
W0926 04:57:16.322694 353779 libfastertransformer.cc:1182] response is nullptr
This is the result of running issue_request.py
[After 8.39s] Partial result (probability 1.00e+00):
[ 5 16162 1079 28 201 89910 41589 3222 33368 2884
4599 0 201 5 10933 16350 28 201 89910 769
10 28080 18414 253 72990 14 223 379 409 813
978 1000 223 392 662 1752 979 100805 93716 475
1779 11 529 6767 862 4640 67458 2814 79524 3671
6767 635 1397 7175 22735 16192 2923 36295 1276 16
565 720 1582 3613 28922 16192 2923 12317 223 379
455 5052 22109 4683 21622 487 45312 505 65355 29250
2047 7175 885 48088 17994 2450 713 637 26029 26914
476 31803 25949 16 33012 6902 16867 10587 2217 1301
26977 654 32299 873 16 3561 6093 529 1997 14498
94855 6814 1071 26776 4173 3191 16 223 386 12586
1031 36099 3009 639 554 885 1183 12495 1694 11948
14 223 391 815 931 588 223 379 456 6750
223 379 461 813 792 813 990 813 600 366
26 5052 1445 936 17791 2490 15822 7055 1789 87914
1545 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0]
Process Process-1:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/workspace/fastertransformer_backend/tools/issue_request.py", line 112, in stream_consumer
result = queue.get()
File "/opt/conda/lib/python3.10/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
TypeError: InferenceServerException.__init__() missing 1 required positional argument: 'msg'
Please review this issue @byshiue