TensorRT-LLM when a model has layers with and without GPT plugin enabled, GptSession raises error

System Info

TensorRT-LLM: latest main branch built in the triton-trtllm container (23.12) GPU: V100

Who can help?

@byshiue

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Copy the offical Attention class and modify slightly to disable the GptPlugin as follows:

Class CustomAttention(Attention):
    def forward(...):
        ...
        if False and default_net().plugin_config.gpt_attention_plugin:
        ...

Modify the GPTNeox layer to let the last layer use the CustomAttention with GptPlugin disabled:

class GPTNeoXDecoderLayer(Module):
    def __init__(...):
        ...
        if layer_idx == config.num_hidden_layers - 1:
            attn_cls = CustomAttention
        else:
            attn_cls = Attention
        self.attention = attn_cls(
        ...

Build the engine:

trtllm-build --checkpoint_dir ./trt_ckpt/fp16/4p/ \
             --use_gemm_plugin float16 \
             --use_gpt_attention_plugin float16 \
             --max_batch_size 1 \
             --max_input_len 2048 \
             --max_output_len 1024 \
             --workers 4 \
             --output_dir ./trt_engines/fp16/4p_customattn/

Run the engine with GpySession and PySession:

mpirun -n 4 --allow-run-as-root \
    python ../run.py --max_output_len=20 \
                     --engine_dir=./trt_engines/fp16/4p_customattn/ \
                     --tokenizer_dir=./gpt-neox-20b

mpirun -n 4 --allow-run-as-root \
    python ../run.py --max_output_len=20 \
                     --engine_dir=./trt_engines/fp16/4p_customattn/ \
                     --tokenizer_dir=./gpt-neox-20b \
                     --use_py_session

Expected behavior

The engine can be used with both GptSession and PySession.

actual behavior

The engine has been built successfully, and can be run with PySession. Running the engine with PySession is normal with the following output:

Input [Text 0]: "Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: " chef in the kitchens of the Chateaux of the Loire, and then in the kitchen"

However, it raises RuntimeError when GptSession is used:

[TensorRT-LLM][ERROR] 7: [shapeMachine.cpp::executeContinuation::887] Error Code 7: Internal Error (Dimensions with name past_key_len must be equal. Condition '==' violated: 32 != 44. Instruction: CHECK_EQUAL 32 44.)
[TensorRT-LLM][ERROR] 7: [shapeMachine.cpp::executeContinuation::887] Error Code 7: Internal Error (Dimensions with name past_key_len must be equal. Condition '==' violated: 32 != 44. Instruction: CHECK_EQUAL 32 44.)
[TensorRT-LLM][ERROR] 7: [shapeMachine.cpp::executeContinuation::887] Error Code 7: Internal Error (Dimensions with name past_key_len must be equal. Condition '==' violated: 32 != 44. Instruction: CHECK_EQUAL 32 44.)
[TensorRT-LLM][ERROR] 7: [shapeMachine.cpp::executeContinuation::887] Error Code 7: Internal Error (Dimensions with name past_key_len must be equal. Condition '==' violated: 32 != 44. Instruction: CHECK_EQUAL 32 44.)
Traceback (most recent call last):
  File "/home/ma-user/work/TensorRT-LLM/examples/gptneox/../run.py", line 496, in <module>
    main(args)
  File "/home/ma-user/work/TensorRT-LLM/examples/gptneox/../run.py", line 374, in main
Traceback (most recent call last):
  File "/home/ma-user/work/TensorRT-LLM/examples/gptneox/../run.py", line 496, in <module>
    outputs = runner.generate(
  File "/home/ma-user/.local/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 338, in generate
    self.session.generate(generation_output, generation_input,
RuntimeError: Invalid input shape (/home/ma-user/work/TensorRT-LLM/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:180)
1       0x7f6b8a7ec6a3 /home/ma-user/.local/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x9c6a3) [0x7f6b8a7ec6a3]
2       0x7f6b8a88f1f0 tensorrt_llm::runtime::GptSession::executeGenerationStep(int, std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator<tensorrt_llm::runtime::GenerationOutput> >&, std::vector<int, std::allocator<int> > const&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager*, std::vector<bool, std::allocator<bool> >&) + 992
3       0x7f6b8a891148 tensorrt_llm::runtime::GptSession::generateBatched(std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator<tensorrt_llm::runtime::GenerationOutput> >&, std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, tensorrt_llm::runtime::SamplingConfig const&, std::function<void (int, bool)> const&) + 3368
4       0x7f6b8a8929f8 tensorrt_llm::runtime::GptSession::generate(tensorrt_llm::runtime::GenerationOutput&, tensorrt_llm::runtime::GenerationInput const&, tensorrt_llm::runtime::SamplingConfig const&) + 3080
5       0x7f6b8a824459 /home/ma-user/.local/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xd4459) [0x7f6b8a824459]
6       0x7f6b8a80c47e /home/ma-user/.local/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xbc47e) [0x7f6b8a80c47e]
7       0x5609c0fe310e python(+0x15a10e) [0x5609c0fe310e]
8       0x5609c0fd9a7b _PyObject_MakeTpCall + 603
9       0x5609c0ff1acb python(+0x168acb) [0x5609c0ff1acb]
10      0x5609c0fd1cfa _PyEval_EvalFrameDefault + 24906
11      0x5609c0ff17f1 python(+0x1687f1) [0x5609c0ff17f1]
12      0x5609c0ff2492 PyObject_Call + 290
13      0x5609c0fce5d7 _PyEval_EvalFrameDefault + 10791
14      0x5609c0fe39fc _PyFunction_Vectorcall + 124
15      0x5609c0fcc26d _PyEval_EvalFrameDefault + 1725
16      0x5609c0fc89c6 python(+0x13f9c6) [0x5609c0fc89c6]
17      0x5609c10be256 PyEval_EvalCode + 134
18      0x5609c10e9108 python(+0x260108) [0x5609c10e9108]
19      0x5609c10e29cb python(+0x2599cb) [0x5609c10e29cb]
20      0x5609c10e8e55 python(+0x25fe55) [0x5609c10e8e55]
21      0x5609c10e8338 _PyRun_SimpleFileObject + 424
22      0x5609c10e7f83 _PyRun_AnyFileObject + 67
23      0x5609c10daa5e Py_RunMain + 702
24      0x5609c10b102d Py_BytesMain + 45
25      0x7f6cedafdd90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f6cedafdd90]
26      0x7f6cedafde40 __libc_start_main + 128
27      0x5609c10b0f25 _start + 37

Jan 26 '24 15:01 llan-ml

@QiJune any updates on this?

May 16 '25 20:05 poweiw

@llan-ml , Apologies for the very delayed response. Is this ticket still relevant? If so, could you try the latest version to see if the issue persists?

Oct 21 '25 05:10 karljang

Issue has not received an update in over 14 days. Adding stale label.

Nov 05 '25 03:11 github-actions[bot]

Closing issue as stale, please feel free to open new one if the problem persists.

Nov 14 '25 17:11 karljang