TensorRT-LLM Encounter CUDA error when increasing the length of input

Encounter CUDA error when increasing the length of input_ids

Open 1649759610 opened this issue 1 year ago • 2 comments

System Info

GPU: A800 GPU memory: 80G TensorRT-LLM: 0.8.0 CUDA: 12.1 OS: unbuntu

Who can help?

@byshiue @kaiyux

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

It is difficult to reproduce this situation. but I can describe it as follows:

I quantize a llama model with awq and int8 kv cache, and I build this model with such params:

max_batch_size : 1
max_input_len: 256000
max_output_len: 64
max_num_tokens: 256000

When I run this built-model with input_ids' shape [1, 190000]， it work well. But I increase input_ids' shape to [1, 200000], or 200000+，it raise exception as follows:

Expected behavior

work well with input_ids shape [1, 200000+]

actual behavior

additional notes

If i build model with such params, and inference with 200000+ input_ids, it will raise another CUDA error: --gemm_plugin float16
--context_fmha_fp32_acc enable
--remove_input_padding disable
--multi_block_mode enable
--max_batch_size 1
--max_input_len 200100
--max_output_len 64
--max_beam_width 1
--max_num_tokens 201100