TensorRT-LLM
TensorRT-LLM copied to clipboard
Encoder Only Executor not working
Hello, when creating a new executor using an encoder engine and enqueuing a request the result is always:
[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: std::bad_cast
Steps to reproduce:
- Build a whisper encoder using the official script and run the code
import tensorrt_llm.bindings.executor as trtllm
import torch
engine_path = "engine_path"
executor = trtllm.Executor(
model_path=engine_path + "/encoder/",
model_type=trtllm.ModelType.ENCODER_ONLY,
executor_config=trtllm.ExecutorConfig(),
)
request = trtllm.Request(
input_token_ids=[1],
encoder_input_features=input.T.contiguous(),
encoder_output_length=1500
max_tokens=1,
)
executor.enqueue_request(request)
[TensorRT-LLM][INFO] Allocating buffers for encoder output
[TensorRT-LLM][INFO] Changing state of request and setting encoder output to skip encoder run
[TensorRT-LLM][ERROR] ICudaEngine::getTensorDataType: Error Code 3: Internal Error (Given invalid tensor name: logits. Get valid tensor names with getIOTensorName())
[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: std::bad_cast
Even if I change the output tensor name from encoder_output to logits and rebuild the engine, the error is still the same
@MahmoudAshraf97 Hi, I suggest using the custom-defined executor as shown here: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/run.py#L149. I'm not sure if trtllm.Executor is compatible with the Whisper encoder. The trtllm.ModelType.ENCODER_ONLY may have hardcoded logic for models like BERT that accept input_ids and output discrete ids.
Hi @yuekaizhang, the main reason for using the executor API is to use batching manager, this allows submitting individual requests in a similar manner as the LLM API which is much faster than triton server, if I use the custom defined executor, this will not be available as I would have to handle batching and distribution of the results myself. Maybe this should be considered a feature request since there are many encoder only models that will benefit from this such as embedding models and encoder-only speech models, and the code is already there, not like it's a completely separate implementation