tensorrtllm_backend Mllama example does not run properly for v0.15 when using the `tensorrt_llm

Mllama example does not run properly for v0.15 when using the `tensorrt_llm_bls` endpoint

Open here4dadata opened this issue 10 months ago • 0 comments

When following the steps highlighted in the examples for mllama we run into two issues.

The cross_kv_cache_fraction parameter is expected to be set in tensorrt_llm/config.pbtxt, wheras it is not set at all in the examples, and attempting to follow the examples will fail. You can set it manually to something like 0.5 to get past this issue.
When actually sending the example curl request, but replacing ensemble with tensorrt_llm_bls, we end up with the following error: "Traceback (most recent call last):\n File \"/models/tensorrt_llm_bls/1/model.py\", line 108, in execute\n for res in res_gen:\n File \"/models/tensorrt_llm_bls/1/lib/decode.py\", line 223, in decode\n gen_response = self._generate_non_streaming(\n File \"/models/tensorrt_llm_bls/1/lib/triton_decoder.py\", line 350, in _generate_non_streaming\n r = self._exec_triton_request_single(triton_req)\n File \"/models/tensorrt_llm_bls/1/lib/triton_decoder.py\", line 149, in _exec_triton_request_single\n raise pb_utils.TritonModelException(responses.error().message())\nc_python_backend_utils.TritonModelException: Executor failed process requestId 5 due to the following error: Encountered an error in forwardAsync function: GenericLlmRequest::getEncoderInputLen - Do not have encoder length! (/workspace/tensorrt_llm/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:580)\n1 0x78deb6f675e6 tensorrt_llm::batch_manager::GenericLlmRequest<std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::shared_ptr<tensorrt_llm::runtime::CudaStream> >::getEncoderInputLen() const + 246\n2 0x78deb6f87d98 tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::getRemainingBlocksToCompletion(tensorrt_llm::batch_manager::LlmRequest const&) const + 312\n3 0x78deb6f51172 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x2b50172) [0x78deb6f51172]\n4 0x78deb6f5152f tensorrt_llm::batch_manager::GuaranteedNoEvictScheduler::operator()(tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const&, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const>, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::BasePeftCacheManager const>, std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) const + 47\n5 0x78deb6f5259f /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x2b5159f) [0x78deb6f5259f]\n6 0x78deb6f4dfa1 tensorrt_llm::batch_manager::CapacityScheduler::operator()(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager>, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::BasePeftCacheManager const>, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager const>) const + 97\n7 0x78deb6fe32f9 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 649\n8 0x78deb7021297 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 455\n9 0x78deb7027755 tensorrt_llm::executor::Executor::Impl::executionLoop() + 1365\n10 0x78dfa8308253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x78dfa8308253]\n11 0x78dfa7e6bac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x78dfa7e6bac3]\n12 0x78dfa7efca04 clone + 68\n"

What is this encoder error error: Encountered an error in forwardAsync function: GenericLlmRequest::getEncoderInputLen - Do not have encoder length!? We set the encoder input len when building the trtllm engine with the following flag --max_encoder_input_len 8200 ... Is there another parameter we have to populate when sending requests with the bls endpoint?

Dec 24 '24 16:12 here4dadata

tensorrtllm_backend tensorrtllm_backend copied to clipboard

Mllama example does not run properly for v0.15 when using the `tensorrt_llm_bls` endpoint

tensorrtllm_backend
tensorrtllm_backend copied to clipboard