TensorRT-LLM gptSessionBenchmark Failed Because of " Assertion failed: d == a + length " with 0.7.1 Release in tritonserver:23.12-trtllm-python-py3 Image

trafficstars

Trying to replicate the benchmark by following the official guide for Llama2-7b with latest release 0.7.1 and triton server image 23.12-trtllm-python-py3 on a single H100 GPU.

Build engine command (followed the official guide):

python examples/llama/build.py \
	--remove_input_padding \
	--enable_context_fmha \
	--parallel_build \
	--output_dir examples/llama/out/7b/fp16_1gpu/ \
	--dtype float16 \
	--use_gpt_attention_plugin float16 \
	--world_size 1 \
	--tp_size 1 \
	--pp_size 1 \
	--max_batch_size 64 \
	--max_input_len 2048 \
	--max_output_len 2048 \
	--enable_fp8 \
	--fp8_kv_cache \
	--strongly_typed \
	--n_layer 32 \
	--n_head 32 \
	--n_embd 4096 \
	--inter_size 11008 \
	--vocab_size 32000 \
	--n_positions 4096 \
	--hidden_act silu

Benchmark command: ./cpp/build/benchmarks/gptSessionBenchmark --model llama --engine_dir examples/llama/out/7b/fp16_1gpu/ --batch_size "1" --input_output_len "512, 200"

Error logs:

...
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: d == a + length (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/gptAttentionCommon/gp[72/1962]
nCommon.cpp:418)
1       0x7fff5fed512f /opt/tritonserver/backends/tensorrtllm/libnvinfer_plugin_tensorrt_llm.so.9(+0x4512f) [0x7fff5fed512f]
2       0x7fff5ff41846 tensorrt_llm::plugins::GPTAttentionPluginCommon::GPTAttentionPluginCommon(void const*, unsigned long) + 870
3       0x7fff5ff588b3 tensorrt_llm::plugins::GPTAttentionPlugin::GPTAttentionPlugin(void const*, unsigned long) + 19
4       0x7fff5ff58932 tensorrt_llm::plugins::GPTAttentionPluginCreator::deserializePlugin(char const*, void const*, unsigned long) + 50
5       0x7fff1afef8a6 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10d68a6) [0x7fff1afef8a6]
6       0x7fff1afe766e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10ce66e) [0x7fff1afe766e]
7       0x7fff1af82217 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1069217) [0x7fff1af82217]
8       0x7fff1af8019e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x106719e) [0x7fff1af8019e]
9       0x7fff1af97c2b /usr/local/tensorrt/lib/libnvinfer.so.9(+0x107ec2b) [0x7fff1af97c2b]
10      0x7fff1af9ae32 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1081e32) [0x7fff1af9ae32]
11      0x7fff1af9b20c /usr/local/tensorrt/lib/libnvinfer.so.9(+0x108220c) [0x7fff1af9b20c]
12      0x7fff1afce9b1 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b59b1) [0x7fff1afce9b1]
13      0x7fff1afcf777 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b6777) [0x7fff1afcf777]
14      0x7fffa8713d22 tensorrt_llm::runtime::TllmRuntime::TllmRuntime(void const*, unsigned long, nvinfer1::ILogger&) + 482
15      0x7fffa86d03fb tensorrt_llm::runtime::GptSession::GptSession(tensorrt_llm::runtime::GptSession::Config const&, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::W
orldConfig const&, void const*, unsigned long, std::shared_ptr<nvinfer1::ILogger>) + 667
16      0x55555556c275 ./cpp/build/benchmarks/gptSessionBenchmark(+0x18275) [0x55555556c275]
17      0x7fff5fa3ad90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fff5fa3ad90]
18      0x7fff5fa3ae40 __libc_start_main + 128
19      0x55555556f765 ./cpp/build/benchmarks/gptSessionBenchmark(+0x1b765) [0x55555556f765]

[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: d == a + length (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentio
nCommon.cpp:418)
1       0x7fff5fed512f /opt/tritonserver/backends/tensorrtllm/libnvinfer_plugin_tensorrt_llm.so.9(+0x4512f) [0x7fff5fed512f]
2       0x7fff5ff41846 tensorrt_llm::plugins::GPTAttentionPluginCommon::GPTAttentionPluginCommon(void const*, unsigned long) + 870
3       0x7fff5ff588b3 tensorrt_llm::plugins::GPTAttentionPlugin::GPTAttentionPlugin(void const*, unsigned long) + 19
4       0x7fff5ff58932 tensorrt_llm::plugins::GPTAttentionPluginCreator::deserializePlugin(char const*, void const*, unsigned long) + 50
5       0x7fff1afef8a6 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10d68a6) [0x7fff1afef8a6]
6       0x7fff1afe766e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10ce66e) [0x7fff1afe766e]
7       0x7fff1af82217 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1069217) [0x7fff1af82217]
8       0x7fff1af8019e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x106719e) [0x7fff1af8019e]
9       0x7fff1af97c2b /usr/local/tensorrt/lib/libnvinfer.so.9(+0x107ec2b) [0x7fff1af97c2b]
10      0x7fff1af9ae32 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1081e32) [0x7fff1af9ae32]
11      0x7fff1af9b20c /usr/local/tensorrt/lib/libnvinfer.so.9(+0x108220c) [0x7fff1af9b20c]
12      0x7fff1afce9b1 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b59b1) [0x7fff1afce9b1]
13      0x7fff1afcf777 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b6777) [0x7fff1afcf777]
14      0x7fffa8713d22 tensorrt_llm::runtime::TllmRuntime::TllmRuntime(void const*, unsigned long, nvinfer1::ILogger&) + 482
15      0x7fffa86d03fb tensorrt_llm::runtime::GptSession::GptSession(tensorrt_llm::runtime::GptSession::Config const&, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::W
orldConfig const&, void const*, unsigned long, std::shared_ptr<nvinfer1::ILogger>) + 667
                                                                                           [33/1962]
16      0x55555556c275 ./cpp/build/benchmarks/gptSessionBenchmark(+0x18275) [0x55555556c275]
17      0x7fff5fa3ad90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fff5fa3ad90]
18      0x7fff5fa3ae40 __libc_start_main + 128
19      0x55555556f765 ./cpp/build/benchmarks/gptSessionBenchmark(+0x1b765) [0x55555556f765]
[28791db46ff1:44226:0:44226] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:  44226) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x00000000010d475f createInferRuntime_INTERNAL()  ???:0
 2 0x000000000107dd42 getInferLibVersion()  ???:0
 3 0x00000000010808ec getInferLibVersion()  ???:0
 4 0x0000000001081e32 getInferLibVersion()  ???:0
 5 0x000000000108220c getInferLibVersion()  ???:0
 6 0x00000000010b59b1 createInferRuntime_INTERNAL()  ???:0
 7 0x00000000010b6777 createInferRuntime_INTERNAL()  ???:0
 8 0x0000000001fc9d22 tensorrt_llm::runtime::TllmRuntime::TllmRuntime()  ???:0
 9 0x0000000001f863fb tensorrt_llm::runtime::GptSession::GptSession()  ???:0
10 0x0000000000018275 main()  ???:0
11 0x0000000000029d90 __libc_init_first()  ???:0
12 0x0000000000029e40 __libc_start_main()  ???:0
13 0x000000000001b765 _start()  ???:0
=================================

=================================                                                                                                                                                   [12/1962]
[28791db46ff1:44226] *** Process received signal ***
[28791db46ff1:44226] Signal: Segmentation fault (11)
[28791db46ff1:44226] Signal code:  (-6)
[28791db46ff1:44226] Failing at address: 0xacc2
[28791db46ff1:44226] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fff5fa53520]
[28791db46ff1:44226] [ 1] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10d475f)[0x7fff1afed75f]
[28791db46ff1:44226] [ 2] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x107dd42)[0x7fff1af96d42]
[28791db46ff1:44226] [ 3] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10808ec)[0x7fff1af998ec]
[28791db46ff1:44226] [ 4] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1081e32)[0x7fff1af9ae32]
[28791db46ff1:44226] [ 5] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x108220c)[0x7fff1af9b20c]
[28791db46ff1:44226] [ 6] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b59b1)[0x7fff1afce9b1]
[28791db46ff1:44226] [ 7] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b6777)[0x7fff1afcf777]
[28791db46ff1:44226] [ 8] /opt/tritonserver/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm7runtime11TllmRuntimeC2EPKvmRN8nvinfer17ILoggerE+0x1e2)[0x7fffa8713d22]
[28791db46ff1:44226] [ 9] /opt/tritonserver/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm7runtime10GptSessionC1ERKNS1_6ConfigERKNS0_14GptModelConfigERKNS0_11WorldConfigEPKvmSt10shared_ptrIN8nvinfer17ILoggerEE+0x29b)[0x7fffa86d03fb]
[28791db46ff1:44226] [10] ./cpp/build/benchmarks/gptSessionBenchmark(+0x18275)[0x55555556c275]
[28791db46ff1:44226] [11] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fff5fa3ad90]
[28791db46ff1:44226] [12] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fff5fa3ae40]
[28791db46ff1:44226] [13] ./cpp/build/benchmarks/gptSessionBenchmark(+0x1b765)[0x55555556f765]
[28791db46ff1:44226] *** End of error message ***
Segmentation fault (core dumped)

This seems to be a different issue from #656 even it is the same experiment.

Jan 04 '24 23:01 taozhang9527

From my experiences, it seems the following combination works well

Build engine using trtllm 0.6.1
Triton Server 23.11-trtllm-python-py3 to serve the engine

Jan 05 '24 00:01 wangyubo111

Yes. The previous version works. This seems to be a new issue in the new 0.7.1 release.

Jan 05 '24 00:01 taozhang9527

@taozhang9527 Try to recompile TRT_LLM and triton trt_llm backend, replace libtriton_tensorrtllm.so in the /opt/tritonserver/backends/tensorrtllm directory and delete the corresponding libnvinfer_plugin_tensorrt_llm.so*

it works for me~

Jan 05 '24 03:01 wjj19950828

It is often because the TRT LLM versions of engine and backend are different. By default, 23.12-trtllm-python-py3 installs the v0.7.0 instead of v0.7.1.

So, please try the suggestion of wjj19950828

Jan 05 '24 03:01 byshiue

@wjj19950828 Which version of the tensorrtllm folder you used for the replacement? I am using the folder in 23.11-trtllm-python-py to replace the one in 23.12-trtllm-python-py. It has the same error. Here is what I have after the replacement:

root@xxxxx:/opt/tritonserver/backends/tensorrtllm# ll
total 683980
drwxrwxrwx 1 triton-server triton-server       267 Jan  5 18:30 ./
drwxrwxrwx 1 triton-server triton-server        33 Dec 15 20:22 ../
-rw-rw-rw- 1 root          root          599044272 Dec 15 18:31 libnvinfer_plugin_tensorrt_llm.so
lrwxrwxrwx 1 root          root                 33 Nov 21 01:10 libnvinfer_plugin_tensorrt_llm.so.9 -> libnvinfer_plugin_tensorrt_llm.so
lrwxrwxrwx 1 root          root                 33 Nov 21 01:10 libnvinfer_plugin_tensorrt_llm.so.9.1.0 -> libnvinfer_plugin_tensorrt_llm.so
-rwxr-xr-x 1 root          root           31755152 Dec 15 18:31 libth_common.so*
-rw-rw-rw- 1 root          root           69590528 Nov 21 01:10 libtriton_tensorrtllm.so

@byshiue Is there any future plan to fix this? Work-around seems to be temporary solution here, right?

Jan 05 '24 18:01 taozhang9527

We have discussion about how to support compatibilities between different version, but we couldn't provide timeline now.

Jan 08 '24 03:01 byshiue

@taozhang9527 What settings worked for you? Do you know if the previous version worked? Or did you copy the file (from @wjj19950828 suggestion)? ->I don't know where to get that file as I can't find the path, from the other issue you created.

Jan 11 '24 19:01 rbgo404

Hi there, I'm facing the same issue and am kinda new to the whole build process. I was also wondering what is the recommended way to verify one's own tensorRT-LLM/triton-inference-serve/tensorrtllm_backend versions? I'm building within a pre-built docker image tensorrt_llm/release:latest and then attempting to serve with nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3, which seems to be the versions that are on the left column of the frameworks matrix.

Separately I checkout a specific commit with the 0.7.0 version tag and run the build the commands within these docker images to ensure that I'm on v0.7.0 (based on what the framework states is the version for 23.12) but I still get these errors. If it helps, these are the commands I'm running:

docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864  \
    --gpus=all \
    --volume $HOME/TensorRT-LLM:/code/tensorrt_llm \
    --volume /models:/models \
    --env "CCACHE_DIR=/code/tensorrt_llm/cpp/.ccache" \
    --env "CCACHE_BASEDIR=/code/tensorrt_llm" \
    --workdir /code/tensorrt_llm \
    --hostname ip-172-31-43-170-release \
    --name tensorrt_llm-release-ubuntu \
    --tmpfs /tmp:exec \
    tensorrt_llm/release:latest

# edit this to point to your model path
export HF_LLAMA_MODEL=/models/mymodel
export ENGINE_DIR=/models/trt_engines

# build the engine and put it in /models/trt_engines
python3 examples/llama/build.py \
    --model_dir ${HF_LLAMA_MODEL} \
    --dtype float16 \
    --world_size 4 \
    --tp_size 4 \
    --pp_size 1 \
    --parallel_build \
    --use_gpt_attention_plugin float16 \
    --use_gemm_plugin float16 \
    --remove_input_padding \
    --use_inflight_batching \
    --paged_kv_cache \
    --max_batch_size 8 \
    --output_dir /models/trt_engines

# enter the triton-inference-server docker container
docker run \
  --rm \
  -it \
  --net host \
  --shm-size=2g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --gpus all \
  -v ${HOME}/tensorrtllm_backend:/tensorrtllm_backend \
  -v /models:/models \
  nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3 bash

pip install protobuf sentencepiece

cd /tensorrtllm_backend
cp all_models/inflight_batcher_llm/ llama_ifb -r
export HF_LLAMA_MODEL=/models/mymodel
export ENGINE_DIR=/models/trt_engines

python3 tools/fill_template.py -i llama_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:8,preprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:8,postprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:8,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i llama_ifb/ensemble/config.pbtxt triton_max_batch_size:8
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm/config.pbtxt triton_max_batch_size:8,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_DIR},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_batching,max_queue_delay_microseconds:600

# launch, but this line fails.
python3 scripts/launch_triton_server.py --world_size=4 --model_repo=llama_ifb

Hope this helps with reproducing

Jan 12 '24 10:01 wongjingping

@wongjingping Could you include the scripts to clone repo, checking out the branch and build docker image to make sure we could reproduce your issue?

Jan 15 '24 01:01 byshiue

Hi @byshiue sorry I realised I made a mistake by not checking out the right tag in the TensorRT-LLM repository and ended up building off the wrong branch version. Checking out v0.7.0 for TensorRT-LLM worked.

Jan 26 '24 06:01 wongjingping

https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-24-01.html 24.01 should have v0.7.1 tensort-llm compatibility

be sure to have v0.7.2 tensrortLLM backend

Feb 09 '24 17:02 enochlev

TensorRT-LLM TensorRT-LLM copied to clipboard

gptSessionBenchmark Failed Because of " Assertion failed: d == a + length " with 0.7.1 Release in tritonserver:23.12-trtllm-python-py3 Image

TensorRT-LLM
TensorRT-LLM copied to clipboard