TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Encoder Only Executor not working

Open MahmoudAshraf97 opened this issue 8 months ago • 3 comments

Hello, when creating a new executor using an encoder engine and enqueuing a request the result is always:

[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: std::bad_cast

Steps to reproduce:

  1. Build a whisper encoder using the official script and run the code
import tensorrt_llm.bindings.executor as trtllm
import torch
engine_path = "engine_path"
executor = trtllm.Executor(
    model_path=engine_path + "/encoder/",
    model_type=trtllm.ModelType.ENCODER_ONLY,
    executor_config=trtllm.ExecutorConfig(),
)

request = trtllm.Request(
    input_token_ids=[1],
    encoder_input_features=input.T.contiguous(),
    encoder_output_length=1500
    max_tokens=1,
)

executor.enqueue_request(request)
[TensorRT-LLM][INFO] Allocating buffers for encoder output
[TensorRT-LLM][INFO] Changing state of request and setting encoder output to skip encoder run
[TensorRT-LLM][ERROR] ICudaEngine::getTensorDataType: Error Code 3: Internal Error (Given invalid tensor name: logits. Get valid tensor names with getIOTensorName())
[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: std::bad_cast

MahmoudAshraf97 avatar Mar 13 '25 09:03 MahmoudAshraf97

Even if I change the output tensor name from encoder_output to logits and rebuild the engine, the error is still the same

MahmoudAshraf97 avatar Mar 13 '25 09:03 MahmoudAshraf97

@MahmoudAshraf97 Hi, I suggest using the custom-defined executor as shown here: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/run.py#L149. I'm not sure if trtllm.Executor is compatible with the Whisper encoder. The trtllm.ModelType.ENCODER_ONLY may have hardcoded logic for models like BERT that accept input_ids and output discrete ids.

yuekaizhang avatar Mar 14 '25 02:03 yuekaizhang

Hi @yuekaizhang, the main reason for using the executor API is to use batching manager, this allows submitting individual requests in a similar manner as the LLM API which is much faster than triton server, if I use the custom defined executor, this will not be available as I would have to handle batching and distribution of the results myself. Maybe this should be considered a feature request since there are many encoder only models that will benefit from this such as embedding models and encoder-only speech models, and the code is already there, not like it's a completely separate implementation

MahmoudAshraf97 avatar Mar 14 '25 08:03 MahmoudAshraf97