TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

[Feature Request] Support for kv_reuse with int8_kv_cache in FMHA

Open StarrickLiu opened this issue 6 months ago • 1 comments

KV Cache Reuse and Int8 KV Cache Compatibility with Paged Context FMHA

In TensorRT-LLM v0.11, it appears that KV cache reuse and Int8 KV cache cannot be used together. KV cache reuse requires enabling --use_paged_context_fmha, but paged context FMHA does not yet support Int8 KV cache.

Request

Could you please look into supporting both KV cache reuse and Int8 KV cache with paged context FMHA? This would allow users to benefit from both optimizations simultaneously.

Additional Context

  1. FMHA is a closed-source operator, which means users cannot resolve this incompatibility issue themselves. We rely on the TensorRT-LLM team to address this limitation.

  2. We've observed that if we remove the assertion that prevents KV cache reuse and Int8 quantization from being used together in the TensorRT-LLM backend deployment, the system crashes after processing approximately 3000 requests during stress testing.

Impact

This limitation significantly affects our ability to optimize performance and efficiency in large-scale deployments. The inability to use KV cache reuse with Int8 KV cache forces us to choose between these optimizations, potentially leading to suboptimal performance or increased resource usage.

Thank you for your attention to this critical issue. We look forward to a solution that allows us to leverage both optimizations safely and effectively.

Error Log:

[TensorRT-LLM][DEBUG] MemoryPool: Requested to reserve 4 B (256 B aligned)
[TensorRT-LLM][DEBUG] MemoryPool: Requested to reserve 4 B (256 B aligned)
[TensorRT-LLM][DEBUG] MemoryPool: Requested to reserve 4 B (256 B aligned)
[TensorRT-LLM][DEBUG] MemoryPool: Requested to reserve 8 B (256 B aligned)
[TensorRT-LLM][DEBUG] MemoryPool: Requested to reserve 4 B (256 B aligned)
[TensorRT-LLM][DEBUG] MemoryPool: Requested to reserve 4 B (256 B aligned)
[TensorRT-LLM][DEBUG] MemoryPool: Requested to reserve 4 B (256 B aligned)
[TensorRT-LLM][DEBUG] MemoryPool: Requested to reserve 4 B (256 B aligned)
[TensorRT-LLM][DEBUG] MemoryPool: Requested to reserve 12316 B (12544 B aligned)
[TensorRT-LLM][DEBUG] MemoryPool: Requested to reserve 4 B (256 B aligned)
[TensorRT-LLM][DEBUG] MemoryPool: Requested to reserve 4 B (256 B aligned)
[TensorRT-LLM][DEBUG] MemoryPool: Requested to reserve 4 B (256 B aligned)
[TensorRT-LLM][DEBUG] MemoryPool: Requested to reserve 4 B (256 B aligned)
[TensorRT-LLM][DEBUG] Enqueuing 1 requests
 0# 0x00005573A628104D in /opt/tritonserver/bin/tritonserver
 1# 0x00007FF3B7D18520 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# pthread_kill in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# raise in /usr/lib/x86_64-linux-gnu/libc.so.6
 4# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
 5# 0x00007FF3B7FA1B9E in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x00007FF3B7FAD20C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 7# 0x00007FF3B7FAC1E9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 8# __gxx_personality_v0 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 9# 0x00007FF3B9C6B884 in /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
10# _Unwind_Resume in /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
11# 0x00007FF2A888AD4F in /app/tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.10
12# 0x00007FF2A88A70A6 in /app/tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.10
13# 0x00007FF2A88ADEC2 in /app/tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.10
14# tensorrt_llm::plugins::GPTAttentionPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) in /app/tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.10
15# 0x00007FF275FCFA8C in /usr/local/tensorrt/lib/libnvinfer.so.10
16# 0x00007FF275F74657 in /usr/local/tensorrt/lib/libnvinfer.so.10
17# 0x00007FF275F760C1 in /usr/local/tensorrt/lib/libnvinfer.so.10
18# tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) in /app/tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so
19# tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) in /app/tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so
20# tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) in /app/tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so
21# tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) in /app/tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so
22# tensorrt_llm::executor::Executor::Impl::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) in /app/tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so
23# tensorrt_llm::executor::Executor::Impl::executionLoop() in /app/tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so
24# 0x00007FF3B7FDB253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
25# 0x00007FF3B7D6AAC3 in /usr/lib/x86_64-linux-gnu/libc.so.6
26# 0x00007FF3B7DFC850 in /usr/lib/x86_64-linux-gnu/libc.so.6

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
I0829 10:18:31.791832 15428 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.795030 15525 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.796618 15067 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.797395 15508 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.805726 15542 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.812080 15079 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.820024 15458 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.829598 16596 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.829779 16060 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.829986 16262 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.830455 16082 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.830639 15941 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.830752 16593 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.830858 16384 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.830951 15884 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.831156 16509 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.831303 16378 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.831582 16147 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.839674 15407 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.849488 15439 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.962322 15166 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.982569 15671 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.983360 15260 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.985603 15206 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.986189 15298 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.986575 15324 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.986885 15352 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.992544 15369 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:31.997749 15493 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:32.000468 15478 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:32.017681 15381 pb_stub.cc:2121]  Non-graceful termination detected. 
I0829 10:18:32.017732 15393 pb_stub.cc:2121]  Non-graceful termination detected. 
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node tensorrt-llm-test exited on signal 6 (Aborted).
--------------------------------------------------------------------------

StarrickLiu avatar Aug 29 '24 12:08 StarrickLiu