TensorRT-LLM
TensorRT-LLM copied to clipboard
gptSessionBenchmark Failed Because of " Assertion failed: d == a + length " with 0.7.1 Release in tritonserver:23.12-trtllm-python-py3 Image
Trying to replicate the benchmark by following the official guide for Llama2-7b with latest release 0.7.1 and triton server image 23.12-trtllm-python-py3 on a single H100 GPU.
Build engine command (followed the official guide):
python examples/llama/build.py \
--remove_input_padding \
--enable_context_fmha \
--parallel_build \
--output_dir examples/llama/out/7b/fp16_1gpu/ \
--dtype float16 \
--use_gpt_attention_plugin float16 \
--world_size 1 \
--tp_size 1 \
--pp_size 1 \
--max_batch_size 64 \
--max_input_len 2048 \
--max_output_len 2048 \
--enable_fp8 \
--fp8_kv_cache \
--strongly_typed \
--n_layer 32 \
--n_head 32 \
--n_embd 4096 \
--inter_size 11008 \
--vocab_size 32000 \
--n_positions 4096 \
--hidden_act silu
Benchmark command:
./cpp/build/benchmarks/gptSessionBenchmark --model llama --engine_dir examples/llama/out/7b/fp16_1gpu/ --batch_size "1" --input_output_len "512, 200"
Error logs:
...
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: d == a + length (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/gptAttentionCommon/gp[72/1962]
nCommon.cpp:418)
1 0x7fff5fed512f /opt/tritonserver/backends/tensorrtllm/libnvinfer_plugin_tensorrt_llm.so.9(+0x4512f) [0x7fff5fed512f]
2 0x7fff5ff41846 tensorrt_llm::plugins::GPTAttentionPluginCommon::GPTAttentionPluginCommon(void const*, unsigned long) + 870
3 0x7fff5ff588b3 tensorrt_llm::plugins::GPTAttentionPlugin::GPTAttentionPlugin(void const*, unsigned long) + 19
4 0x7fff5ff58932 tensorrt_llm::plugins::GPTAttentionPluginCreator::deserializePlugin(char const*, void const*, unsigned long) + 50
5 0x7fff1afef8a6 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10d68a6) [0x7fff1afef8a6]
6 0x7fff1afe766e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10ce66e) [0x7fff1afe766e]
7 0x7fff1af82217 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1069217) [0x7fff1af82217]
8 0x7fff1af8019e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x106719e) [0x7fff1af8019e]
9 0x7fff1af97c2b /usr/local/tensorrt/lib/libnvinfer.so.9(+0x107ec2b) [0x7fff1af97c2b]
10 0x7fff1af9ae32 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1081e32) [0x7fff1af9ae32]
11 0x7fff1af9b20c /usr/local/tensorrt/lib/libnvinfer.so.9(+0x108220c) [0x7fff1af9b20c]
12 0x7fff1afce9b1 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b59b1) [0x7fff1afce9b1]
13 0x7fff1afcf777 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b6777) [0x7fff1afcf777]
14 0x7fffa8713d22 tensorrt_llm::runtime::TllmRuntime::TllmRuntime(void const*, unsigned long, nvinfer1::ILogger&) + 482
15 0x7fffa86d03fb tensorrt_llm::runtime::GptSession::GptSession(tensorrt_llm::runtime::GptSession::Config const&, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::W
orldConfig const&, void const*, unsigned long, std::shared_ptr<nvinfer1::ILogger>) + 667
16 0x55555556c275 ./cpp/build/benchmarks/gptSessionBenchmark(+0x18275) [0x55555556c275]
17 0x7fff5fa3ad90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fff5fa3ad90]
18 0x7fff5fa3ae40 __libc_start_main + 128
19 0x55555556f765 ./cpp/build/benchmarks/gptSessionBenchmark(+0x1b765) [0x55555556f765]
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: d == a + length (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentio
nCommon.cpp:418)
1 0x7fff5fed512f /opt/tritonserver/backends/tensorrtllm/libnvinfer_plugin_tensorrt_llm.so.9(+0x4512f) [0x7fff5fed512f]
2 0x7fff5ff41846 tensorrt_llm::plugins::GPTAttentionPluginCommon::GPTAttentionPluginCommon(void const*, unsigned long) + 870
3 0x7fff5ff588b3 tensorrt_llm::plugins::GPTAttentionPlugin::GPTAttentionPlugin(void const*, unsigned long) + 19
4 0x7fff5ff58932 tensorrt_llm::plugins::GPTAttentionPluginCreator::deserializePlugin(char const*, void const*, unsigned long) + 50
5 0x7fff1afef8a6 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10d68a6) [0x7fff1afef8a6]
6 0x7fff1afe766e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10ce66e) [0x7fff1afe766e]
7 0x7fff1af82217 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1069217) [0x7fff1af82217]
8 0x7fff1af8019e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x106719e) [0x7fff1af8019e]
9 0x7fff1af97c2b /usr/local/tensorrt/lib/libnvinfer.so.9(+0x107ec2b) [0x7fff1af97c2b]
10 0x7fff1af9ae32 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1081e32) [0x7fff1af9ae32]
11 0x7fff1af9b20c /usr/local/tensorrt/lib/libnvinfer.so.9(+0x108220c) [0x7fff1af9b20c]
12 0x7fff1afce9b1 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b59b1) [0x7fff1afce9b1]
13 0x7fff1afcf777 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b6777) [0x7fff1afcf777]
14 0x7fffa8713d22 tensorrt_llm::runtime::TllmRuntime::TllmRuntime(void const*, unsigned long, nvinfer1::ILogger&) + 482
15 0x7fffa86d03fb tensorrt_llm::runtime::GptSession::GptSession(tensorrt_llm::runtime::GptSession::Config const&, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::W
orldConfig const&, void const*, unsigned long, std::shared_ptr<nvinfer1::ILogger>) + 667
[33/1962]
16 0x55555556c275 ./cpp/build/benchmarks/gptSessionBenchmark(+0x18275) [0x55555556c275]
17 0x7fff5fa3ad90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fff5fa3ad90]
18 0x7fff5fa3ae40 __libc_start_main + 128
19 0x55555556f765 ./cpp/build/benchmarks/gptSessionBenchmark(+0x1b765) [0x55555556f765]
[28791db46ff1:44226:0:44226] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 44226) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x00000000010d475f createInferRuntime_INTERNAL() ???:0
2 0x000000000107dd42 getInferLibVersion() ???:0
3 0x00000000010808ec getInferLibVersion() ???:0
4 0x0000000001081e32 getInferLibVersion() ???:0
5 0x000000000108220c getInferLibVersion() ???:0
6 0x00000000010b59b1 createInferRuntime_INTERNAL() ???:0
7 0x00000000010b6777 createInferRuntime_INTERNAL() ???:0
8 0x0000000001fc9d22 tensorrt_llm::runtime::TllmRuntime::TllmRuntime() ???:0
9 0x0000000001f863fb tensorrt_llm::runtime::GptSession::GptSession() ???:0
10 0x0000000000018275 main() ???:0
11 0x0000000000029d90 __libc_init_first() ???:0
12 0x0000000000029e40 __libc_start_main() ???:0
13 0x000000000001b765 _start() ???:0
=================================
================================= [12/1962]
[28791db46ff1:44226] *** Process received signal ***
[28791db46ff1:44226] Signal: Segmentation fault (11)
[28791db46ff1:44226] Signal code: (-6)
[28791db46ff1:44226] Failing at address: 0xacc2
[28791db46ff1:44226] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fff5fa53520]
[28791db46ff1:44226] [ 1] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10d475f)[0x7fff1afed75f]
[28791db46ff1:44226] [ 2] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x107dd42)[0x7fff1af96d42]
[28791db46ff1:44226] [ 3] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10808ec)[0x7fff1af998ec]
[28791db46ff1:44226] [ 4] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1081e32)[0x7fff1af9ae32]
[28791db46ff1:44226] [ 5] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x108220c)[0x7fff1af9b20c]
[28791db46ff1:44226] [ 6] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b59b1)[0x7fff1afce9b1]
[28791db46ff1:44226] [ 7] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b6777)[0x7fff1afcf777]
[28791db46ff1:44226] [ 8] /opt/tritonserver/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm7runtime11TllmRuntimeC2EPKvmRN8nvinfer17ILoggerE+0x1e2)[0x7fffa8713d22]
[28791db46ff1:44226] [ 9] /opt/tritonserver/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm7runtime10GptSessionC1ERKNS1_6ConfigERKNS0_14GptModelConfigERKNS0_11WorldConfigEPKvmSt10shared_ptrIN8nvinfer17ILoggerEE+0x29b)[0x7fffa86d03fb]
[28791db46ff1:44226] [10] ./cpp/build/benchmarks/gptSessionBenchmark(+0x18275)[0x55555556c275]
[28791db46ff1:44226] [11] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fff5fa3ad90]
[28791db46ff1:44226] [12] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fff5fa3ae40]
[28791db46ff1:44226] [13] ./cpp/build/benchmarks/gptSessionBenchmark(+0x1b765)[0x55555556f765]
[28791db46ff1:44226] *** End of error message ***
Segmentation fault (core dumped)
This seems to be a different issue from #656 even it is the same experiment.
From my experiences, it seems the following combination works well
- Build engine using trtllm 0.6.1
- Triton Server 23.11-trtllm-python-py3 to serve the engine
Yes. The previous version works. This seems to be a new issue in the new 0.7.1 release.
@taozhang9527 Try to recompile TRT_LLM and triton trt_llm backend, replace libtriton_tensorrtllm.so in the /opt/tritonserver/backends/tensorrtllm directory and delete the corresponding libnvinfer_plugin_tensorrt_llm.so*
it works for me~
It is often because the TRT LLM versions of engine and backend are different. By default, 23.12-trtllm-python-py3 installs the v0.7.0 instead of v0.7.1.
So, please try the suggestion of wjj19950828
@wjj19950828 Which version of the tensorrtllm folder you used for the replacement?
I am using the folder in 23.11-trtllm-python-py to replace the one in 23.12-trtllm-python-py. It has the same error. Here is what I have after the replacement:
root@xxxxx:/opt/tritonserver/backends/tensorrtllm# ll
total 683980
drwxrwxrwx 1 triton-server triton-server 267 Jan 5 18:30 ./
drwxrwxrwx 1 triton-server triton-server 33 Dec 15 20:22 ../
-rw-rw-rw- 1 root root 599044272 Dec 15 18:31 libnvinfer_plugin_tensorrt_llm.so
lrwxrwxrwx 1 root root 33 Nov 21 01:10 libnvinfer_plugin_tensorrt_llm.so.9 -> libnvinfer_plugin_tensorrt_llm.so
lrwxrwxrwx 1 root root 33 Nov 21 01:10 libnvinfer_plugin_tensorrt_llm.so.9.1.0 -> libnvinfer_plugin_tensorrt_llm.so
-rwxr-xr-x 1 root root 31755152 Dec 15 18:31 libth_common.so*
-rw-rw-rw- 1 root root 69590528 Nov 21 01:10 libtriton_tensorrtllm.so
@byshiue Is there any future plan to fix this? Work-around seems to be temporary solution here, right?
We have discussion about how to support compatibilities between different version, but we couldn't provide timeline now.
@taozhang9527 What settings worked for you? Do you know if the previous version worked? Or did you copy the file (from @wjj19950828 suggestion)? ->I don't know where to get that file as I can't find the path, from the other issue you created.
Hi there, I'm facing the same issue and am kinda new to the whole build process. I was also wondering what is the recommended way to verify one's own tensorRT-LLM/triton-inference-serve/tensorrtllm_backend versions? I'm building within a pre-built docker image tensorrt_llm/release:latest and then attempting to serve with nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3, which seems to be the versions that are on the left column of the frameworks matrix.
Separately I checkout a specific commit with the 0.7.0 version tag and run the build the commands within these docker images to ensure that I'm on v0.7.0 (based on what the framework states is the version for 23.12) but I still get these errors. If it helps, these are the commands I'm running:
docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
--gpus=all \
--volume $HOME/TensorRT-LLM:/code/tensorrt_llm \
--volume /models:/models \
--env "CCACHE_DIR=/code/tensorrt_llm/cpp/.ccache" \
--env "CCACHE_BASEDIR=/code/tensorrt_llm" \
--workdir /code/tensorrt_llm \
--hostname ip-172-31-43-170-release \
--name tensorrt_llm-release-ubuntu \
--tmpfs /tmp:exec \
tensorrt_llm/release:latest
# edit this to point to your model path
export HF_LLAMA_MODEL=/models/mymodel
export ENGINE_DIR=/models/trt_engines
# build the engine and put it in /models/trt_engines
python3 examples/llama/build.py \
--model_dir ${HF_LLAMA_MODEL} \
--dtype float16 \
--world_size 4 \
--tp_size 4 \
--pp_size 1 \
--parallel_build \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16 \
--remove_input_padding \
--use_inflight_batching \
--paged_kv_cache \
--max_batch_size 8 \
--output_dir /models/trt_engines
# enter the triton-inference-server docker container
docker run \
--rm \
-it \
--net host \
--shm-size=2g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--gpus all \
-v ${HOME}/tensorrtllm_backend:/tensorrtllm_backend \
-v /models:/models \
nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3 bash
pip install protobuf sentencepiece
cd /tensorrtllm_backend
cp all_models/inflight_batcher_llm/ llama_ifb -r
export HF_LLAMA_MODEL=/models/mymodel
export ENGINE_DIR=/models/trt_engines
python3 tools/fill_template.py -i llama_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:8,preprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:8,postprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:8,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i llama_ifb/ensemble/config.pbtxt triton_max_batch_size:8
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm/config.pbtxt triton_max_batch_size:8,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_DIR},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_batching,max_queue_delay_microseconds:600
# launch, but this line fails.
python3 scripts/launch_triton_server.py --world_size=4 --model_repo=llama_ifb
Hope this helps with reproducing
@wongjingping Could you include the scripts to clone repo, checking out the branch and build docker image to make sure we could reproduce your issue?
Hi @byshiue sorry I realised I made a mistake by not checking out the right tag in the TensorRT-LLM repository and ended up building off the wrong branch version. Checking out v0.7.0 for TensorRT-LLM worked.
https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-24-01.html 24.01 should have v0.7.1 tensort-llm compatibility
be sure to have v0.7.2 tensrortLLM backend