TensorRT-LLM
TensorRT-LLM copied to clipboard
kv_cache_reuse breaking on awq quantized model
System Info
- X86_64
- RAM: 30 GB
- GPU: A10G, VRAM: 23GB
- Lib: Tensorrt-LLM v0.9.0
- Container Used: nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
- Model used: Mistral 7B
Who can help?
@Tracin , @kaiyux , @byshiue
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Quantized with: the command:
python ../quantization/quantize.py --model_dir <model_fp16> \
--dtype float16 \
--qformat int4_awq \
--awq_block_size 128 \
--output_dir <model_repo> \
--calib_size 32
trtllm-build --checkpoint_dir <model_repo> \
--output_dir <engine_repo> \
--gemm_plugin float16 --use_paged_context_fmha enable --max_input_len 4000 --max_output_len 400 --max_batch_size 12
Started the model with arguments:
7 python3 tools/fill_template.py -i test-model/preprocessing/config.pbtxt tokenizer_dir:${TOK_PATH},triton_max_batch_size:12,preproc
essing_instance_count:1
8 python3 tools/fill_template.py -i test-model/postprocessing/config.pbtxt tokenizer_dir:${TOK_PATH},triton_max_batch_size:12,postpr
ocessing_instance_count:1
9 python3 tools/fill_template.py -i test-model/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:12,decoupled_mode:False,bls_insta
nce_count:1,accumulate_tokens:False
10 python3 tools/fill_template.py -i test-model/ensemble/config.pbtxt triton_max_batch_size:12
11 python3 tools/fill_template.py -i test-model/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:12,decoupl
ed_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:52000,max_attention_window_size:4096,kv_cach
e_free_gpu_mem_fraction:0.9,exclude_input_in_output:True,enable_kv_cache_reuse:True,batching_strategy:inflight_fused_batching,max
_queue_delay_microseconds:0
How to get the error:
When tested with a semaphore of 10(Ensuring 10 requests are always pending at the server), after a few successful predictions, we get the error:
[TensorRT-LLM][ERROR] Encountered error for requestId 991505662: Encountered an error in forward function: [TensorRT-LLM][ERROR] Asserti
on failed: blockedTokens.size() <= blockIds.size() (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/batch_manager/kvCacheMana
ger.cpp:568)
Information which might help in debugging:
The requests get dropped and the server stops working only when the initial set of kv cache in its entirety is full. The server is unable to kickout the LRU kv_cache in paged attention as it is supposed to do.
This can be confirmed by the fact that the server runs without any issues when enable_kv_cache_reuse is set to off.
Expected behavior
The Model should continue to serve the requests without any issues.
actual behavior
We get the following error in the triton server:
[TensorRT-LLM][ERROR] Encountered error for requestId 991505662: Encountered an error in forward function: [TensorRT-LLM][ERROR] Asserti
on failed: blockedTokens.size() <= blockIds.size() (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/batch_manager/kvCacheMana
ger.cpp:568)
additional notes
Information which might help in debugging:
The requests get dropped and the server stops working only when the initial set of kv cache in its entirety is full. The server is unable to kickout the LRU kv_cache in paged attention as it is supposed to do.
This can be confirmed by the fact that the server runs without any issues when enable_kv_cache_reuse is set to off.
@Tracin Could you please take a look? Thanks.
BTW, @Bhuvanesh09 Could you please try the main branch? Or the 0.10.0 release branch?
@Bhuvanesh09 I think kv_cache_reuse is orthogonal to AWQ quantization. For narrow down the issue, could you try with a full-precision model?
@Bhuvanesh09 if you have no further questions, we will close this issue in one week.