TensorRT-LLM Encoder Only Executor not working

Hello, when creating a new executor using an encoder engine and enqueuing a request the result is always:

[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: std::bad_cast

Steps to reproduce:

Build a whisper encoder using the official script and run the code

import tensorrt_llm.bindings.executor as trtllm
import torch
engine_path = "engine_path"
executor = trtllm.Executor(
    model_path=engine_path + "/encoder/",
    model_type=trtllm.ModelType.ENCODER_ONLY,
    executor_config=trtllm.ExecutorConfig(),
)

request = trtllm.Request(
    input_token_ids=[1],
    encoder_input_features=input.T.contiguous(),
    encoder_output_length=1500
    max_tokens=1,
)

executor.enqueue_request(request)

[TensorRT-LLM][INFO] Allocating buffers for encoder output
[TensorRT-LLM][INFO] Changing state of request and setting encoder output to skip encoder run
[TensorRT-LLM][ERROR] ICudaEngine::getTensorDataType: Error Code 3: Internal Error (Given invalid tensor name: logits. Get valid tensor names with getIOTensorName())
[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: std::bad_cast

Mar 13 '25 09:03 MahmoudAshraf97

Even if I change the output tensor name from encoder_output to logits and rebuild the engine, the error is still the same

Mar 13 '25 09:03 MahmoudAshraf97

@MahmoudAshraf97 Hi, I suggest using the custom-defined executor as shown here: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/run.py#L149. I'm not sure if trtllm.Executor is compatible with the Whisper encoder. The trtllm.ModelType.ENCODER_ONLY may have hardcoded logic for models like BERT that accept input_ids and output discrete ids.

Mar 14 '25 02:03 yuekaizhang

Hi @yuekaizhang, the main reason for using the executor API is to use batching manager, this allows submitting individual requests in a similar manner as the LLM API which is much faster than triton server, if I use the custom defined executor, this will not be available as I would have to handle batching and distribution of the results myself. Maybe this should be considered a feature request since there are many encoder only models that will benefit from this such as embedding models and encoder-only speech models, and the code is already there, not like it's a completely separate implementation

Mar 14 '25 08:03 MahmoudAshraf97

TensorRT-LLM TensorRT-LLM copied to clipboard

Encoder Only Executor not working

TensorRT-LLM
TensorRT-LLM copied to clipboard