llama_index
llama_index copied to clipboard
[Bug]: [03/27/2024-11:09:02] [TRT] [E] 6: The engine plan file is generated on an incompatible device, expecting compute 8.0 got compute 8.9, please rebuild.
Bug Description
I just build an engine with TensorRT-LLM version: 0.8.0 and get this error running TensorLLM object:
[03/27/2024-11:09:02] [TRT] [E] 6: The engine plan file is generated on an incompatible device, expecting compute 8.0 got compute 8.9, please rebuild. [03/27/2024-11:09:02] [TRT] [E] 2: [engine.cpp::deserializeEngine::1148] Error Code 2: Internal Error (Assertion engine->deserialize(start, size, allocator, runtime) failed. ) Traceback (most recent call last): File "/opt/tritonserver/ai.py", line 76, in <module> llm = LocalTensorRTLLM( File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/nvidia_tensorrt/base.py", line 175, in __init__ decoder = tensorrt_llm.runtime.GenerationSession( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 492, in __init__ self.runtime = _Runtime(engine_buffer, mapping) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 155, in __init__ self.__prepare(mapping, engine_buffer) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 176, in __prepare assert self.engine is not None AssertionError Exception ignored in: <function _Runtime.__del__ at 0x7efff8123370> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 282, in __del__ cudart.cudaFree(self.address) # FIXME: cudaFree is None?? AttributeError: '_Runtime' object has no attribute 'address'
I'm still on the same container i build the engine in.
Is there a version of LLama where this version is supported?
Version
0.10.23
Steps to Reproduce
Run a TensorLLM object:
from llama_index.llms.nvidia_tensorrt import LocalTensorRTLLM
llm = LocalTensorRTLLM(
model_path="./model",
engine_name="rank0.engine",
tokenizer_dir="Qwen1.5-72B-Chat-GPTQ-Int4",
completion_to_prompt=completion_to_prompt,
verbose=True,
max_new_tokens=640,
temperature=0
)
Relevant Logs/Tracbacks
No response
Thank you for providing detailed information about the issue you're encountering. It seems like the TensorRT-LLM version you're using is expecting a different compute capability than what your device has. To better assist you, could you please provide the following additional information:
- The specific GPU model you're using.
- The exact version of the NVIDIA driver installed on your system.
- The version of CUDA Toolkit you're using.
This information will help us understand the compatibility between your system and the TensorRT-LLM version you're using.
About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.