TensorRT-LLM
TensorRT-LLM copied to clipboard
Encounter CUDA error when increasing the length of input_ids
System Info
GPU: A800 GPU memory: 80G TensorRT-LLM: 0.8.0 CUDA: 12.1 OS: unbuntu
Who can help?
@byshiue @kaiyux
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
It is difficult to reproduce this situation. but I can describe it as follows:
I quantize a llama model with awq and int8 kv cache, and I build this model with such params:
- max_batch_size : 1
- max_input_len: 256000
- max_output_len: 64
- max_num_tokens: 256000
When I run this built-model with input_ids' shape [1, 190000], it work well.
But I increase input_ids' shape to [1, 200000], or 200000+,it raise exception as follows:
Expected behavior
work well with input_ids shape [1, 200000+]
actual behavior
additional notes
If i build model with such params, and inference with 200000+ input_ids, it will raise another CUDA error:
--gemm_plugin float16
--context_fmha_fp32_acc enable
--remove_input_padding disable
--multi_block_mode enable
--max_batch_size 1
--max_input_len 200100
--max_output_len 64
--max_beam_width 1
--max_num_tokens 201100
Any help will be appreciate.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
It looks the issue is caused by shape overflow on TensorRT. We are working on it.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
@1649759610 Do you still have the question? If not, we will close it soon.