tensorrtllm_backend
tensorrtllm_backend copied to clipboard
Mllama example does not run properly for v0.15 when using the `tensorrt_llm_bls` endpoint
When following the steps highlighted in the examples for mllama we run into two issues.
- The
cross_kv_cache_fractionparameter is expected to be set intensorrt_llm/config.pbtxt, wheras it is not set at all in the examples, and attempting to follow the examples will fail. You can set it manually to something like0.5to get past this issue. - When actually sending the example curl request, but replacing
ensemblewithtensorrt_llm_bls, we end up with the following error:"Traceback (most recent call last):\n File \"/models/tensorrt_llm_bls/1/model.py\", line 108, in execute\n for res in res_gen:\n File \"/models/tensorrt_llm_bls/1/lib/decode.py\", line 223, in decode\n gen_response = self._generate_non_streaming(\n File \"/models/tensorrt_llm_bls/1/lib/triton_decoder.py\", line 350, in _generate_non_streaming\n r = self._exec_triton_request_single(triton_req)\n File \"/models/tensorrt_llm_bls/1/lib/triton_decoder.py\", line 149, in _exec_triton_request_single\n raise pb_utils.TritonModelException(responses.error().message())\nc_python_backend_utils.TritonModelException: Executor failed process requestId 5 due to the following error: Encountered an error in forwardAsync function: GenericLlmRequest::getEncoderInputLen - Do not have encoder length! (/workspace/tensorrt_llm/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:580)\n1 0x78deb6f675e6 tensorrt_llm::batch_manager::GenericLlmRequest<std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::shared_ptr<tensorrt_llm::runtime::CudaStream> >::getEncoderInputLen() const + 246\n2 0x78deb6f87d98 tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::getRemainingBlocksToCompletion(tensorrt_llm::batch_manager::LlmRequest const&) const + 312\n3 0x78deb6f51172 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x2b50172) [0x78deb6f51172]\n4 0x78deb6f5152f tensorrt_llm::batch_manager::GuaranteedNoEvictScheduler::operator()(tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const&, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const>, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::BasePeftCacheManager const>, std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) const + 47\n5 0x78deb6f5259f /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x2b5159f) [0x78deb6f5259f]\n6 0x78deb6f4dfa1 tensorrt_llm::batch_manager::CapacityScheduler::operator()(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager>, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::BasePeftCacheManager const>, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const>) const + 97\n7 0x78deb6fe32f9 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 649\n8 0x78deb7021297 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 455\n9 0x78deb7027755 tensorrt_llm::executor::Executor::Impl::executionLoop() + 1365\n10 0x78dfa8308253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x78dfa8308253]\n11 0x78dfa7e6bac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x78dfa7e6bac3]\n12 0x78dfa7efca04 clone + 68\n"
What is this encoder error error: Encountered an error in forwardAsync function: GenericLlmRequest::getEncoderInputLen - Do not have encoder length!? We set the encoder input len when building the trtllm engine with the following flag --max_encoder_input_len 8200 ... Is there another parameter we have to populate when sending requests with the bls endpoint?