TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

user trtllm-serve error

Open SafeCool opened this issue 7 months ago • 1 comments

System Info

RuntimeError: Failed to import transformers.models.bert.modeling_bert because of the following error (look up to see its traceback): /usr/local/lib/python3.12/dist-packages/flash_attn_2_cuda.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

Who can help?

No response

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

trtllm-serve error Traceback (most recent call last): File "/usr/local/bin/trtllm-serve", line 5, in from tensorrt_llm.commands.serve import main File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/init.py", line 33, in import tensorrt_llm.models as models File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/init.py", line 16, in from .bert.model import (BertForQuestionAnswering, File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/bert/model.py", line 32, in from .convert import (load_hf_bert_base, load_hf_bert_cls, load_hf_bert_qa, File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/bert/convert.py", line 23, in from transformers import (BertPreTrainedModel, RobertaPreTrainedModel) File "", line 1412, in _handle_fromlist File "/usr/local/lib/python3.12/dist-packages/transformers/utils/import_utils.py", line 1956, in getattr value = getattr(module, name) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/utils/import_utils.py", line 1955, in getattr module = self._get_module(self._class_to_module[name]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/utils/import_utils.py", line 1969, in _get_module raise RuntimeError( RuntimeError: Failed to import transformers.models.bert.modeling_bert because of the following error (look up to see its traceback): /usr/local/lib/python3.12/dist-packages/flash_attn_2_cuda.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE root@VM-50-46-tencentos:/mnt/data2/dyd/tenorrt_0506# pip list|grep transformers

Expected behavior

d

actual behavior

?

additional notes

d

SafeCool avatar May 16 '25 07:05 SafeCool

Which release are you working with? Can you try out with our main? If there's still an issue, please share steps to reproduce alongside hardware details.

brb-nv avatar May 16 '25 23:05 brb-nv

I encountered the same issue.

Environment:

  • Base image: nvcr.io/nvidia/pytorch:25.04-py3
  • I ran the following steps:
[ -f /etc/pip/constraint.txt ] && : > /etc/pip/constraint.txt
pip uninstall -y tensorrt
pip install tensorrt-llm
cd TensorRT-LLM/examples/auto_deploy
python build_and_run_ad.py --config '{"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"}'

But I got the same error during execution.

caoyicheng11 avatar May 22 '25 10:05 caoyicheng11

I'm also getting similar error in ngc container nvcr.io/nvidia/pytorch:25.04-py3 with tensorrt_llm-0.19.0

In [1]: import tensorrt_llm
ImportError: /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationESs

kHarshit avatar May 22 '25 20:05 kHarshit

@lucaslie , do you think you can answer the question about auto deploy?

brb-nv avatar May 23 '25 00:05 brb-nv

same problem. But 0.20.0rc3 works. I think you should use earlier image of pytorch like nvcr.io/nvidia/pytorch:25.01-py3 (not tested)

bnuzhanyu avatar May 23 '25 08:05 bnuzhanyu

Thanks, I can resolve it by switching to nvcr.io/nvidia/pytorch:25.01-py3 with TensorRT-LLM version: 0.20.0rc3. However, when I run python build_and_run_ad.py --config '{"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"}', I now get an AssertionError: Model Factory AutoModelForCausalLM not found..

Traceback (most recent call last): 
File "/data_8t_1/qby/build_and_run_ad.py", line 135, in <module> 
main() 
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context 
return func(*args, **kwargs) 
^^^^^^^^^^^^^^^^^^^^^ 
File "/data_8t_1/qby/build_and_run_ad.py", line 97, in main 
llm = build_llm_from_config(config) 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
File "/data_8t_1/qby/build_and_run_ad.py", line 60, in build_llm_from_config 
factory = ModelFactoryRegistry.get(config.model_factory)( 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/auto_deploy/models/factory.py", line 154, in get 
assert cls.has(name), f"Model Factory {name} not found." 
^^^^^^^^^^^^^ 
AssertionError: Model Factory AutoModelForCausalLM not found.

After investigating, I found that the ModelFactoryRegistry._registry only contains 'hf', but config.model_factory is set to 'AutoModelForCausalLM'.

@classmethod
def get(cls, name: str) -> Type[ModelFactory]:
    assert cls.has(name), f"Model Factory {name} not found."
    return cls._registry[name]

When I comment out the assertion and force it to return cls._registry['hf'], the code runs correctly. So I’m wondering: is 'AutoModelForCausalLM' supposed to be registered somewhere and wasn’t, or should I manually change the config to use 'hf' instead? Would appreciate any guidance on the correct usage here.

@classmethod
def get(cls, name: str) -> Type[ModelFactory]:
    # assert cls.has(name), f"Model Factory {name} not found."
    return cls._registry['hf']

caoyicheng11 avatar May 23 '25 13:05 caoyicheng11