TensorRT-LLM
TensorRT-LLM copied to clipboard
Encounter CUDA error when increasing the length of input_ids
System Info
GPU: A800 GPU memory: 80G TensorRT-LLM: 0.8.0 CUDA: 12.1 OS: unbuntu
Who can help?
@byshiue @kaiyux
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
It is difficult to reproduce this situation. but I can describe it as follows:
I quantize a llama model with awq and int8 kv cache, and I build this model with such params:
- max_batch_size : 1
- max_input_len: 256000
- max_output_len: 64
- max_num_tokens: 256000
When I run this built-model with input_ids' shape [1, 190000], it work well.
But I increase input_ids' shape to [1, 200000], or 200000+,it raise exception as follows:
Expected behavior
work well with input_ids shape [1, 200000+]
actual behavior
additional notes
If i build model with such params, and inference with 200000+ input_ids, it will raise another CUDA error:
--gemm_plugin float16
--context_fmha_fp32_acc enable
--remove_input_padding disable
--multi_block_mode enable
--max_batch_size 1
--max_input_len 200100
--max_output_len 64
--max_beam_width 1
--max_num_tokens 201100
Any help will be appreciate.