llama_index icon indicating copy to clipboard operation
llama_index copied to clipboard

[Bug]: [03/27/2024-11:09:02] [TRT] [E] 6: The engine plan file is generated on an incompatible device, expecting compute 8.0 got compute 8.9, please rebuild.

Open teis-e opened this issue 10 months ago • 1 comments

Bug Description

I just build an engine with TensorRT-LLM version: 0.8.0 and get this error running TensorLLM object:

[03/27/2024-11:09:02] [TRT] [E] 6: The engine plan file is generated on an incompatible device, expecting compute 8.0 got compute 8.9, please rebuild. [03/27/2024-11:09:02] [TRT] [E] 2: [engine.cpp::deserializeEngine::1148] Error Code 2: Internal Error (Assertion engine->deserialize(start, size, allocator, runtime) failed. ) Traceback (most recent call last): File "/opt/tritonserver/ai.py", line 76, in <module> llm = LocalTensorRTLLM( File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/nvidia_tensorrt/base.py", line 175, in __init__ decoder = tensorrt_llm.runtime.GenerationSession( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 492, in __init__ self.runtime = _Runtime(engine_buffer, mapping) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 155, in __init__ self.__prepare(mapping, engine_buffer) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 176, in __prepare assert self.engine is not None AssertionError Exception ignored in: <function _Runtime.__del__ at 0x7efff8123370> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 282, in __del__ cudart.cudaFree(self.address) # FIXME: cudaFree is None?? AttributeError: '_Runtime' object has no attribute 'address'

I'm still on the same container i build the engine in.

Is there a version of LLama where this version is supported?

Version

0.10.23

Steps to Reproduce

Run a TensorLLM object:

from llama_index.llms.nvidia_tensorrt import LocalTensorRTLLM

llm = LocalTensorRTLLM(
    model_path="./model",
    engine_name="rank0.engine",
    tokenizer_dir="Qwen1.5-72B-Chat-GPTQ-Int4",
    completion_to_prompt=completion_to_prompt,
    verbose=True,
    max_new_tokens=640,
    temperature=0
)

Relevant Logs/Tracbacks

No response

teis-e avatar Mar 27 '24 11:03 teis-e

Thank you for providing detailed information about the issue you're encountering. It seems like the TensorRT-LLM version you're using is expecting a different compute capability than what your device has. To better assist you, could you please provide the following additional information:

  1. The specific GPU model you're using.
  2. The exact version of the NVIDIA driver installed on your system.
  3. The version of CUDA Toolkit you're using.

This information will help us understand the compatibility between your system and the TensorRT-LLM version you're using.

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] avatar Mar 27 '24 11:03 dosubot[bot]