TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

kv_cache_reuse breaking on awq quantized model

Open Bhuvanesh09 opened this issue 1 year ago • 2 comments
trafficstars

System Info

  • X86_64
  • RAM: 30 GB
  • GPU: A10G, VRAM: 23GB
  • Lib: Tensorrt-LLM v0.9.0
  • Container Used: nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
  • Model used: Mistral 7B

Who can help?

@Tracin , @kaiyux , @byshiue

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

Quantized with: the command:

python ../quantization/quantize.py --model_dir <model_fp16> \
	                                   --dtype float16 \
	                                   --qformat int4_awq \
	                                   --awq_block_size 128 \
	                                   --output_dir <model_repo> \
	                                   --calib_size 32 
	
trtllm-build --checkpoint_dir <model_repo> \
	             --output_dir <engine_repo> \
	             --gemm_plugin float16 --use_paged_context_fmha enable --max_input_len 4000 --max_output_len 400 --max_batch_size 12

Started the model with arguments:

7 python3 tools/fill_template.py -i test-model/preprocessing/config.pbtxt tokenizer_dir:${TOK_PATH},triton_max_batch_size:12,preproc
	essing_instance_count:1
8 python3 tools/fill_template.py -i test-model/postprocessing/config.pbtxt tokenizer_dir:${TOK_PATH},triton_max_batch_size:12,postpr
	ocessing_instance_count:1
9 python3 tools/fill_template.py -i test-model/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:12,decoupled_mode:False,bls_insta
	nce_count:1,accumulate_tokens:False
10 python3 tools/fill_template.py -i test-model/ensemble/config.pbtxt triton_max_batch_size:12
11 python3 tools/fill_template.py -i test-model/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:12,decoupl
	      ed_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:52000,max_attention_window_size:4096,kv_cach
	      e_free_gpu_mem_fraction:0.9,exclude_input_in_output:True,enable_kv_cache_reuse:True,batching_strategy:inflight_fused_batching,max
	      _queue_delay_microseconds:0

How to get the error:

When tested with a semaphore of 10(Ensuring 10 requests are always pending at the server), after a few successful predictions, we get the error:

[TensorRT-LLM][ERROR] Encountered error for requestId 991505662: Encountered an error in forward function: [TensorRT-LLM][ERROR] Asserti
on failed: blockedTokens.size() <= blockIds.size() (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/batch_manager/kvCacheMana
ger.cpp:568)

Information which might help in debugging:

The requests get dropped and the server stops working only when the initial set of kv cache in its entirety is full. The server is unable to kickout the LRU kv_cache in paged attention as it is supposed to do. This can be confirmed by the fact that the server runs without any issues when enable_kv_cache_reuse is set to off.

Expected behavior

The Model should continue to serve the requests without any issues.

actual behavior

We get the following error in the triton server:

[TensorRT-LLM][ERROR] Encountered error for requestId 991505662: Encountered an error in forward function: [TensorRT-LLM][ERROR] Asserti
on failed: blockedTokens.size() <= blockIds.size() (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/batch_manager/kvCacheMana
ger.cpp:568)

additional notes

Information which might help in debugging:

The requests get dropped and the server stops working only when the initial set of kv cache in its entirety is full. The server is unable to kickout the LRU kv_cache in paged attention as it is supposed to do. This can be confirmed by the fact that the server runs without any issues when enable_kv_cache_reuse is set to off.

Bhuvanesh09 avatar Jul 03 '24 11:07 Bhuvanesh09

@Tracin Could you please take a look? Thanks.

BTW, @Bhuvanesh09 Could you please try the main branch? Or the 0.10.0 release branch?

QiJune avatar Jul 04 '24 02:07 QiJune

@Bhuvanesh09 I think kv_cache_reuse is orthogonal to AWQ quantization. For narrow down the issue, could you try with a full-precision model?

Tracin avatar Jul 05 '24 04:07 Tracin

@Bhuvanesh09 if you have no further questions, we will close this issue in one week.

hello-11 avatar Nov 14 '24 06:11 hello-11