tensorrtllm_backend Missing lookAheadRuntimeConfig in Triton Server with TensorRT-LLM backend HTTP Request

Missing lookAheadRuntimeConfig in Triton Server with TensorRT-LLM backend HTTP Request

Open shaylapid opened this issue 8 months ago • 2 comments

System Info

CPU architecture: x86_64
GPU NVIDIA H100 80GB
TensorRT-LLM backend tag: v0.17.0
Container used: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
OS Debian GNU/Linux 11 (bullseye)

Who can help?

No response

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Build the model:

Start the container:

docker run --rm -it --net host --shm-size=2g \
    --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
    -v </path/to/git/tensorrtllm_backend>:/tensorrtllm_backend \
    -v </path/to/engines>:/model/engine \
    -v </path/to/hf-checkpoint>:/model/src \
    nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3

Quantize the model:

cd /tensorrtllm_backend/tensorrt_llm/examples/quantization;
python quantize.py \
    --model_dir /model/src  \
    --qformat fp8 \
    --kv_cache_dtype fp8 \
    --output_dir /model/build

Build:

trtllm-build \
    --checkpoint_dir /model/build \
    --output_dir /model/engine \
    --gpt_attention_plugin auto \
    --gemm_plugin fp8 \
    --gemm_swiglu_plugin fp8 \
    --low_latency_gemm_swiglu_plugin fp8 \
    --remove_input_padding enable \
    --context_fmha enable \
    --max_beam_width 1 \
    --max_num_tokens 1000 \
    --max_seq_len 250 \
    --max_input_len 200 \
    --max_batch_size 4 \
    --use_fused_mlp enable \
    --use_fp8_context_fmha enable \
    --use_paged_context_fmha enable \
    --speculative_decoding_mode lookahead_decoding \
    --max_draft_len 15

Adapt model repo:

Adding the following to config.pbtext:

parameters: {
  key: "decoding_mode"
  value: {
    string_value: "lookahead"
  }
}

Run with Tritonserver:

Start the container:

docker run --rm -it --net host --shm-size=2g \
    --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
    -v <path/to/model>:/models \
    nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3

start tritonserver

tritonserver --model-repository=/models

Run Inference

Using the following python script:

import requests

response = requests.post(
    "http://localhost:8000/v2/models/tensorrt_llm_2beam/infer",
    json={
        "inputs": [
            {
                "name": "input_ids",
                "shape": [1, 4],
                "datatype": "INT32",
                "data": [[750, 23811, 31792, 4555]],  # "def hello_world():"
            },
            {
                "name": "input_lengths",
                "shape": [1, 1],
                "datatype": "INT32",
                "data": [[4]],
            },
            {
                "name": "request_output_len",
                "shape": [1, 1],
                "datatype": "INT32",
                "data": [[20]],
            },
        ]
    },
)
try:
    response.raise_for_status()
    print(response.json())
except requests.exceptions.RequestException as e:
    print(response.json()["error"])

Expected behavior

Successfully infer and print the generated tokens ("output_ids" in response.json()).

actual behavior

response.raise_for_status() throws a RequestException, status 500, with the following error:

Executor failed process requestId 15 due to the following error: Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: requests[bi].lookaheadRuntimeConfig (/workspace/tensorrt_llm/cpp/tensorrt_llm/runtime/gptDecoder.cpp:218) 1 0x7f0e8b6bdff8 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 95 2 0x7f0e8ba92022 tensorrt_llm::runtime::GptDecoder<__half>::setup(tensorrt_llm::runtime::SamplingConfig const&, unsigned long, std::shared_ptr<tensorrt_llm::runtime::ITensor const> const&, std::optional<tensorrt_llm::runtime::DecodingOutput> const&, std::optional<std::vector<tensorrt_llm::runtime::decoder_batch::Request, std::allocator<tensorrt_llm::runtime::decoder_batch::Request> > const> const&) + 3074 3 0x7f0e8baa5d0e tensorrt_llm::runtime::GptDecoderBatched::newRequests(std::vector<int, std::allocator > const&, std::vector<tensorrt_llm::runtime::decoder_batch::Request, std::allocator<tensorrt_llm::runtime::decoder_batch::Request> > const&, std::vector<tensorrt_llm::runtime::SamplingConfig, std::allocator<tensorrt_llm::runtime::SamplingConfig> > const&, tensorrt_llm::runtime::ModelConfig const&) + 590 4 0x7f0e8c4f87f5 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::shared_ptr<tensorrt_llm::batch_manager::RuntimeBuffers>&) + 1717 5 0x7f0e8c4fbf40 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1792 6 0x7f0e8c594189 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 457 7 0x7f0e8c5a07df tensorrt_llm::executor::Executor::Impl::executionLoop() + 1247 8 0x7f1130391db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7f1130391db4] 9 0x7f113012fa94 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7f113012fa94] 10 0x7f11301bca34 __clone + 68

additional notes

Seems like Triton expects to get lookaheadRuntimeConfig, which I guess should be the parameters window_size, ngram_size, and verification_set_size in some form. However, I couldn't find a reference for how to pass them in inference, nor how to add them as inputs to config.pbtxt in the model repo.

Feb 18 '25 22:02 shaylapid

tensorrtllm_backend tensorrtllm_backend copied to clipboard

Missing lookAheadRuntimeConfig in Triton Server with TensorRT-LLM backend HTTP Request

System Info

Who can help?

Information

Tasks

Reproduction

Build the model:

Start the container:

Quantize the model:

Build:

Adapt model repo:

Adding the following to config.pbtext:

Run with Tritonserver:

Start the container:

start tritonserver

Run Inference

Using the following python script:

Expected behavior

actual behavior

additional notes

tensorrtllm_backend
tensorrtllm_backend copied to clipboard