TensorRT-LLM
TensorRT-LLM copied to clipboard
Assertion failed: Unsupported architecture (/tensorrt_llm/cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp:89)
System Info
- CPU: x86_64
- GPU: NVIDIA GeForce RTX 2080 Ti
- Memory Size: 12 GiB
- TensorRT -LLM branch: 0.10.0.dev2024042300
- TensorRT: 9.3.0.post12.dev1
- Ubuntu: 22.04
- cuda:12.1.0
Who can help?
No response
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
- huggingface-cli login --token xxxxx(with actuall token used from Hugging Face)
- python3 convert_checkpoint.py --model_dir meta-llama/Llama-2-7b-chat-hf --output_dir llama-2-7b-ckpt
- trtllm-build --checkpoint_dir llama-2-7b-ckpt
--gemm_plugin float16
--output_dir ./llama-2-7b-engine
Expected behavior
I am following the quick start guide, and after it should compile the Llama2 Model into a TensorRT Engine.
actual behavior
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024042300 [04/28/2024-15:57:09] [TRT-LLM] [I] Set bert_attention_plugin to float16. [04/28/2024-15:57:09] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [04/28/2024-15:57:09] [TRT-LLM] [I] Set gemm_plugin to float16. [04/28/2024-15:57:09] [TRT-LLM] [I] Set nccl_plugin to float16. [04/28/2024-15:57:09] [TRT-LLM] [I] Set lookup_plugin to None. [04/28/2024-15:57:09] [TRT-LLM] [I] Set lora_plugin to None. [04/28/2024-15:57:09] [TRT-LLM] [I] Set moe_plugin to float16. [04/28/2024-15:57:09] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16. [04/28/2024-15:57:09] [TRT-LLM] [I] Set context_fmha to True. [04/28/2024-15:57:09] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [04/28/2024-15:57:09] [TRT-LLM] [I] Set paged_kv_cache to True. [04/28/2024-15:57:09] [TRT-LLM] [I] Set remove_input_padding to True. [04/28/2024-15:57:09] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/28/2024-15:57:09] [TRT-LLM] [I] Set multi_block_mode to False. [04/28/2024-15:57:09] [TRT-LLM] [I] Set enable_xqa to True. [04/28/2024-15:57:09] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [04/28/2024-15:57:09] [TRT-LLM] [I] Set tokens_per_block to 128. [04/28/2024-15:57:09] [TRT-LLM] [I] Set use_paged_context_fmha to False. [04/28/2024-15:57:09] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [04/28/2024-15:57:09] [TRT-LLM] [I] Set use_context_fmha_for_generation to False. [04/28/2024-15:57:09] [TRT-LLM] [I] Set multiple_profiles to False. [04/28/2024-15:57:09] [TRT-LLM] [I] Set paged_state to True. [04/28/2024-15:57:09] [TRT-LLM] [I] Set streamingllm to False. [04/28/2024-15:57:09] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len. It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. [04/28/2024-15:57:09] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[04/28/2024-15:57:10] [TRT-LLM] [I] Compute capability: (7, 5) [04/28/2024-15:57:10] [TRT-LLM] [I] SM count: 68 [04/28/2024-15:57:10] [TRT-LLM] [I] SM clock: 2100 MHz [04/28/2024-15:57:10] [TRT-LLM] [I] int4 TFLOPS: 584 [04/28/2024-15:57:10] [TRT-LLM] [I] int8 TFLOPS: 292 [04/28/2024-15:57:10] [TRT-LLM] [I] fp8 TFLOPS: 0 [04/28/2024-15:57:10] [TRT-LLM] [I] float16 TFLOPS: 146 [04/28/2024-15:57:10] [TRT-LLM] [I] bfloat16 TFLOPS: 0 [04/28/2024-15:57:10] [TRT-LLM] [I] float32 TFLOPS: 18 [04/28/2024-15:57:10] [TRT-LLM] [I] Total Memory: 11 GiB [04/28/2024-15:57:10] [TRT-LLM] [I] Memory clock: 7000 MHz [04/28/2024-15:57:10] [TRT-LLM] [I] Memory bus width: 352 [04/28/2024-15:57:10] [TRT-LLM] [I] Memory bandwidth: 616 GB/s [04/28/2024-15:57:10] [TRT-LLM] [I] NVLink is active: False [04/28/2024-15:57:10] [TRT-LLM] [I] PCIe speed: 2500 Mbps [04/28/2024-15:57:10] [TRT-LLM] [I] PCIe link width: 16 [04/28/2024-15:57:10] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s [04/28/2024-15:57:10] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 259, GPU 157 (MiB) [04/28/2024-15:57:12] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +978, GPU +180, now: CPU 1373, GPU 337 (MiB) [04/28/2024-15:57:12] [TRT-LLM] [I] Set nccl_plugin to None. [04/28/2024-15:57:12] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/28/2024-15:57:12] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/28/2024-15:57:12] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. terminate called after throwing an instance of 'tensorrt_llm::common::TllmException' what(): [TensorRT-LLM][ERROR] Assertion failed: Unsupported architecture (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp:89) 1 0x7f120b494f3b tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 82 2 0x7f120b497194 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x691194) [0x7f120b497194] 3 0x7f11deca9eef tensorrt_llm::plugins::GPTAttentionPluginCommon::initialize() + 415 4 0x7f11decd1e6d tensorrt_llm::plugins::GPTAttentionPlugin* tensorrt_llm::plugins::GPTAttentionPluginCommon::cloneImpl<tensorrt_llm::plugins::GPTAttentionPlugin>() const + 573 5 0x7f12dfbd1279 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.9(+0xae3279) [0x7f12dfbd1279] 6 0x7f12dfb1f02e /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.9(+0xa3102e) [0x7f12dfb1f02e] 7 0x7f128b6dfcef /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0xdfcef) [0x7f128b6dfcef] 8 0x7f128b643443 /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0x43443) [0x7f128b643443] 9 0x5617f827010e /usr/bin/python3(+0x15a10e) [0x5617f827010e] 10 0x5617f8266a7b _PyObject_MakeTpCall + 603 11 0x5617f827eacb /usr/bin/python3(+0x168acb) [0x5617f827eacb] 12 0x5617f825ecfa _PyEval_EvalFrameDefault + 24906 13 0x5617f82709fc _PyFunction_Vectorcall + 124 14 0x5617f827f492 PyObject_Call + 290 15 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791 16 0x5617f82709fc _PyFunction_Vectorcall + 124 17 0x5617f827f492 PyObject_Call + 290 18 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791 19 0x5617f827e7f1 /usr/bin/python3(+0x1687f1) [0x5617f827e7f1] 20 0x5617f827f492 PyObject_Call + 290 21 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791 22 0x5617f82709fc _PyFunction_Vectorcall + 124 23 0x5617f8265cbd _PyObject_FastCallDictTstate + 365 24 0x5617f827b86c _PyObject_Call_Prepend + 92 25 0x5617f8396700 /usr/bin/python3(+0x280700) [0x5617f8396700] 26 0x5617f8266a7b _PyObject_MakeTpCall + 603 27 0x5617f8260150 _PyEval_EvalFrameDefault + 30112 28 0x5617f827e7f1 /usr/bin/python3(+0x1687f1) [0x5617f827e7f1] 29 0x5617f827f492 PyObject_Call + 290 30 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791 31 0x5617f82709fc _PyFunction_Vectorcall + 124 32 0x5617f8265cbd _PyObject_FastCallDictTstate + 365 33 0x5617f827b86c _PyObject_Call_Prepend + 92 34 0x5617f8396700 /usr/bin/python3(+0x280700) [0x5617f8396700] 35 0x5617f827f42b PyObject_Call + 187 36 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791 37 0x5617f827e7f1 /usr/bin/python3(+0x1687f1) [0x5617f827e7f1] 38 0x5617f825a53c _PyEval_EvalFrameDefault + 6540 39 0x5617f827e7f1 /usr/bin/python3(+0x1687f1) [0x5617f827e7f1] 40 0x5617f827f492 PyObject_Call + 290 41 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791 42 0x5617f827e7f1 /usr/bin/python3(+0x1687f1) [0x5617f827e7f1] 43 0x5617f827f492 PyObject_Call + 290 44 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791 45 0x5617f82709fc _PyFunction_Vectorcall + 124 46 0x5617f8265cbd _PyObject_FastCallDictTstate + 365 47 0x5617f827b86c _PyObject_Call_Prepend + 92 48 0x5617f8396700 /usr/bin/python3(+0x280700) [0x5617f8396700] 49 0x5617f827f42b PyObject_Call + 187 50 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791 51 0x5617f82709fc _PyFunction_Vectorcall + 124 52 0x5617f825926d _PyEval_EvalFrameDefault + 1725 53 0x5617f82709fc _PyFunction_Vectorcall + 124 54 0x5617f827f492 PyObject_Call + 290 55 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791 56 0x5617f82709fc _PyFunction_Vectorcall + 124 57 0x5617f827f492 PyObject_Call + 290 58 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791 59 0x5617f82709fc _PyFunction_Vectorcall + 124 60 0x5617f827f492 PyObject_Call + 290 61 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791 62 0x5617f82709fc _PyFunction_Vectorcall + 124 63 0x5617f825926d _PyEval_EvalFrameDefault + 1725 64 0x5617f82559c6 /usr/bin/python3(+0x13f9c6) [0x5617f82559c6] 65 0x5617f834b256 PyEval_EvalCode + 134 66 0x5617f8376108 /usr/bin/python3(+0x260108) [0x5617f8376108] 67 0x5617f836f9cb /usr/bin/python3(+0x2599cb) [0x5617f836f9cb] 68 0x5617f8375e55 /usr/bin/python3(+0x25fe55) [0x5617f8375e55] 69 0x5617f8375338 _PyRun_SimpleFileObject + 424 70 0x5617f8374f83 _PyRun_AnyFileObject + 67 71 0x5617f8367a5e Py_RunMain + 702 72 0x5617f833e02d Py_BytesMain + 45 73 0x7f148ebe2d90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f148ebe2d90] 74 0x7f148ebe2e40 __libc_start_main + 128 75 0x5617f833df25 _start + 37 [e62a70965c65:02001] *** Process received signal *** [e62a70965c65:02001] Signal: Aborted (6) [e62a70965c65:02001] Signal code: (-6) [e62a70965c65:02001] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f148ebfb520] [e62a70965c65:02001] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f148ec4f9fc] [e62a70965c65:02001] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f148ebfb476] [e62a70965c65:02001] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f148ebe17f3] [e62a70965c65:02001] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7f13ec476b9e] [e62a70965c65:02001] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f13ec48220c] [e62a70965c65:02001] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7f13ec4811e9] [e62a70965c65:02001] [ 7] /lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7f13ec481959] [e62a70965c65:02001] [ 8] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7f148e8eb884] [e62a70965c65:02001] [ 9] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7f148e8ec2dd] [e62a70965c65:02001] [10] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x691118)[0x7f120b497118] [e62a70965c65:02001] [11] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins24GPTAttentionPluginCommon10initializeEv+0x19f)[0x7f11deca9eef] [e62a70965c65:02001] [12] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZNK12tensorrt_llm7plugins24GPTAttentionPluginCommon9cloneImplINS0_18GPTAttentionPluginEEEPT_v+0x23d)[0x7f11decd1e6d] [e62a70965c65:02001] [13] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.9(+0xae3279)[0x7f12dfbd1279] [e62a70965c65:02001] [14] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.9(+0xa3102e)[0x7f12dfb1f02e] [e62a70965c65:02001] [15] /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0xdfcef)[0x7f128b6dfcef] [e62a70965c65:02001] [16] /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0x43443)[0x7f128b643443] [e62a70965c65:02001] [17] /usr/bin/python3(+0x15a10e)[0x5617f827010e] [e62a70965c65:02001] [18] /usr/bin/python3(_PyObject_MakeTpCall+0x25b)[0x5617f8266a7b] [e62a70965c65:02001] [19] /usr/bin/python3(+0x168acb)[0x5617f827eacb] [e62a70965c65:02001] [20] /usr/bin/python3(_PyEval_EvalFrameDefault+0x614a)[0x5617f825ecfa] [e62a70965c65:02001] [21] /usr/bin/python3(_PyFunction_Vectorcall+0x7c)[0x5617f82709fc] [e62a70965c65:02001] [22] /usr/bin/python3(PyObject_Call+0x122)[0x5617f827f492] [e62a70965c65:02001] [23] /usr/bin/python3(_PyEval_EvalFrameDefault+0x2a27)[0x5617f825b5d7] [e62a70965c65:02001] [24] /usr/bin/python3(_PyFunction_Vectorcall+0x7c)[0x5617f82709fc] [e62a70965c65:02001] [25] /usr/bin/python3(PyObject_Call+0x122)[0x5617f827f492] [e62a70965c65:02001] [26] /usr/bin/python3(_PyEval_EvalFrameDefault+0x2a27)[0x5617f825b5d7] [e62a70965c65:02001] [27] /usr/bin/python3(+0x1687f1)[0x5617f827e7f1] [e62a70965c65:02001] [28] /usr/bin/python3(PyObject_Call+0x122)[0x5617f827f492] [e62a70965c65:02001] [29] /usr/bin/python3(_PyEval_EvalFrameDefault+0x2a27)[0x5617f825b5d7] [e62a70965c65:02001] *** End of error message *** Aborted (core dumped)
additional notes
None
Please set --context_fmha disable
during building engine because the fused mha kernel is not supported on Turing GPU, and it will fallback to unfused case.
Thanks, now it can convert successfully! However, when I running the command: python3 ../run.py --engine_dir llama-2-7b-engine --max_output_len 100 --tokenizer_dir meta-llama/Llama-2-7b-chat-hf --input_text "How do I count to nine in French?" I got the new Error: "root@81b5288ae872:/TensorRT-LLM/examples/llama# python3 ../run.py --engine_dir llama-2-7b-engine --max_output_len 100 --tokenizer_dir meta-llama/Llama-2-7b-chat-hf --input_text "How do I count to nine in French?"
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024043000
[TensorRT-LLM][INFO] Engine version 0.10.0.dev2024043000 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
Traceback (most recent call last):
File "/TensorRT-LLM/examples/llama/../run.py", line 564, in
It looks the TRT-LLM cannot load your engine successfully. Could you try rebuild the repo, and try the end to end workflow again?
Also, could you share the log of building engine? It might has some issue during building engine.
The contextFMHA doesn't support geforce 2080Ti whichi sm version is sm75.
https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp#L91.
You may try to build the engine by disabling the context_fmha --context_fmha disable
or on other supported hardware.