TensorRT-LLM Question about the relation between ` --max_input_len --max_output_len` at building time and `input_output

System Info

NVIDIA H20 97871MiB * 8

trt-llm 0.9.0

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Model: Llama2-70b-chat-hf ; 8 GPU 1 node

python3 convert_checkpoint.py --model_dir /TensorRT-LLM/Llama2-70b-chat-hf \
                             --output_dir /TensorRT-LLM/examples/llama70b/tllm_checkpoint_8gpu_tp8 \
                             --dtype float16 \
                             --tp_size 8

trtllm-build --checkpoint_dir /TensorRT-LLM/examples/llama70b/tllm_checkpoint_8gpu_tp8 \
            --output_dir /TensorRT-LLM/examples/llama/tmp/llama/70B/trt_engines/fp16/8gpu_tp8 \
            --gemm_plugin float16 \
            --max_batch_size 16 \
            --max_input_len=4009 \
            --max_output_len=4009

mpirun -n 8 --allow-run-as-root \
python3  /TensorRT-LLM/benchmarks/python/benchmark.py \
    --model llama_70b \
    --mode plugin \
    --batch_size "1;4;8;16" \
    --csv \
    --engine_dir /TensorRT-LLM/examples/llama/tmp/llama/70B/trt_engines/fp16/8gpu_tp8 \
    --input_output_len "2048,2048"

Here when calling trtllm-build I set --max_input_len=4009 \ --max_output_len=4009 ，but there is a bug when calling /TensorRT-LLM/benchmarks/python/benchmark.py with --input_output_len "2048,2048" ：


Traceback (most recent call last):
  File "/TensorRT-LLM/benchmarks/python/benchmark.py", line 412, in <module>
    main(args)
  File "/TensorRT-LLM/benchmarks/python/benchmark.py", line 371, in main
    e.with_traceback())
TypeError: BaseException.with_traceback() takes exactly one argument (0 given)
[04/26/2024-04:57:33] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
[04/26/2024-04:57:33] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
Traceback (most recent call last):
  File "/TensorRT-LLM/benchmarks/python/benchmark.py", line 346, in main
    benchmarker.run(inputs, config)
  File "/TensorRT-LLM/benchmarks/python/gpt_benchmark.py", line 220, in run
    self.decoder.decode_batch(inputs[0],
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2803, in decode_batch
    return self.decode(input_ids,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 789, in wrapper
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2993, in decode
    return self.decode_regular(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2642, in decode_regular
    should_stop, next_step_tensors, tasks, context_lengths, host_context_lengths, attention_mask, context_logits, generation_logits, encoder_input_lengths = self.handle_per_step(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2334, in handle_per_step
    raise RuntimeError(f"Executing TRT engine failed step={step}!")
RuntimeError: Executing TRT engine failed step=0!

Actually the pre-defined 4009 is greater than 2048 , so why this bug occurs? I tried "2048,128" and it works but "2048,2048" fails.

What is the relationship between max_input_len max_output_len and input_output_len ?

Expected behavior

No bug

actual behavior

bug

additional notes

What is the relationship between max_input_len max_output_len and input_output_len ?

Apr 26 '24 05:04 YiandLi

same error......

Apr 29 '24 02:04 liminn

same error and only occurs when batch=16

Jun 13 '24 02:06 RobinJYM

Question about the relation between ` --max_input_len --max_output_len` at building time and `input_output_len` at benchmark .

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes