TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Encounter CUDA error when increasing the length of input_ids

Open 1649759610 opened this issue 10 months ago • 2 comments

System Info

GPU: A800 GPU memory: 80G TensorRT-LLM: 0.8.0 CUDA: 12.1 OS: unbuntu

Who can help?

@byshiue @kaiyux

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

It is difficult to reproduce this situation. but I can describe it as follows:

I quantize a llama model with awq and int8 kv cache, and I build this model with such params:

  • max_batch_size : 1
  • max_input_len: 256000
  • max_output_len: 64
  • max_num_tokens: 256000

When I run this built-model with input_ids' shape [1, 190000], it work well. But I increase input_ids' shape to [1, 200000], or 200000+,it raise exception as follows: image

Expected behavior

work well with input_ids shape [1, 200000+]

actual behavior

image

additional notes

If i build model with such params, and inference with 200000+ input_ids, it will raise another CUDA error: --gemm_plugin float16
--context_fmha_fp32_acc enable
--remove_input_padding disable
--multi_block_mode enable
--max_batch_size 1
--max_input_len 200100
--max_output_len 64
--max_beam_width 1
--max_num_tokens 201100

image

Any help will be appreciate.

1649759610 avatar Apr 08 '24 09:04 1649759610