TensorRT-LLM
TensorRT-LLM copied to clipboard
[Question]Any documents for the feature "prefix caching?"
Hi, thanks for your great job. According to the issue: #620 , prefix caching is supported in TensorRT-llm. I'd like to know is there any documents for the feature? Thanks !
Here are some document https://github.com/NVIDIA/TensorRT-LLM/blob/3d56a445e8ebf888e78be638faf6beec0a78f3c2/docs/source/batch_manager.md?plain=1#L137.
Thanks, how can i use this flag in the process of llama export and inference? seems that no kvCacheConfig is set in build.py or ../run.py
@byshiue Hi, i tested here.
https://github.com/NVIDIA/TensorRT-LLM/blob/3d56a445e8ebf888e78be638faf6beec0a78f3c2/tensorrt_llm/runtime/model_runner_cpp.py#L166
and i add enable_block_reuse=True, for KvCacheConfig building process, but the profiles shows that it's same between enable_block_reuse=True and enable_block_reuse=False.
I tested the same prompt for 10 times, and warmup is also into consideration. The model is Yi-6B-Chat and the script i use is run.py.
Can you help to have a look? Thanks!
@byshiue Hi, i tested here.
https://github.com/NVIDIA/TensorRT-LLM/blob/3d56a445e8ebf888e78be638faf6beec0a78f3c2/tensorrt_llm/runtime/model_runner_cpp.py#L166
and i add
enable_block_reuse=True,forKvCacheConfigbuilding process, but the profiles shows that it's same betweenenable_block_reuse=Trueandenable_block_reuse=False.I tested the same prompt for 10 times, and warmup is also into consideration. The model is Yi-6B-Chat and the script i use is
run.py.Can you help to have a look? Thanks!
Does prefix caching need to be combined with a specific kv chunking method? For example, if I want add enable_block_reuse=True,I should also set enable paged_kv_cache when building.
@byshiue Hi, i tested here.
https://github.com/NVIDIA/TensorRT-LLM/blob/3d56a445e8ebf888e78be638faf6beec0a78f3c2/tensorrt_llm/runtime/model_runner_cpp.py#L166
and i add
enable_block_reuse=True,forKvCacheConfigbuilding process, but the profiles shows that it's same betweenenable_block_reuse=Trueandenable_block_reuse=False.I tested the same prompt for 10 times, and warmup is also into consideration. The model is Yi-6B-Chat and the script i use is
run.py.Can you help to have a look? Thanks!
I have same situation, and the profiles shows that it's same prefill time between enable_block_reuse=True and enable_block_reuse=False.
To enable enable_block_reuse, you need to enable paged_kv_cache and use_paged_context_fmha.
do you know how to use system prompt caching in qwen1.5?
It is same on all models. You need to enable paged_kv_cache and use_paged_context_fmha.
It is same on all models. You need to enable
paged_kv_cacheanduse_paged_context_fmha.
hi, thanks for your reply, i set in the scene which contains ~200 same prefix tokens (100 different, and about 400 output). The qps seems same (before set: 29.366qps, after set: 29.585qps). I use ModelRunnerCpp based on TensorRT-LLM 090, could be please help to have a look at the performance? Thanks!
--paged_kv_cache "enable" \
--use_paged_context_fmha "enable" \
Whole conversion scripts
python convert_checkpoint.py \
--model_dir ${hf_model_dir} \
--output_dir ${tmp_dir} \
--dtype float16
trtllm-build \
--checkpoint_dir ${tmp_dir} \
--output_dir ${trt_model_dir} \
--remove_input_padding "enable" \
--gpt_attention_plugin "float16" \
--context_fmha "enable" \
--gemm_plugin="float16" \
--max_batch_size 256 \
--max_input_len 1024 \
--max_output_len 800 \
--paged_kv_cache "enable" \
--use_paged_context_fmha "enable"
any follow up
@littletomatodonkey Could you also share the scripts to run inference?
I think the above is happening because we need to set enableBlockReuse as True explicitly but the flag is not exposed for the user to toggle.
@byshiue can you confirm if this is the case?
I think the above is happening because we need to set
enableBlockReuseasTrueexplicitly but the flag is not exposed for the user to toggle.@byshiue can you confirm if this is the case?
Hi, @ekagra-ranjan I fix the code in TensorRT-LLM, which is as follows. But the QPS does not change, seems that it's a problem but not the key-point.
session_config.kv_cache_config = KvCacheConfig(
free_gpu_memory_fraction=free_gpu_memory_fraction,
max_attention_window=max_attention_window_size,
sink_token_length=sink_token_length,
enable_block_reuse=True)
@littletomatodonkey Could you also share the scripts to run inference?
@byshiue Hi, this is my inference code.
class TRTModel090(object):
def __init__(self, config, *inputs, **kwargs):
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunnerCpp as ModelRunner
super().__init__()
self.world_size = tensorrt_llm.mpi_world_size()
self.runtime_rank = tensorrt_llm.mpi_rank()
runner_kwargs = dict(engine_dir=config.name_or_path,
rank=self.runtime_rank)
self.runner = ModelRunner.from_dir(**runner_kwargs)
def get_sample_config(self, batch_size: int, **kwargs):
import tensorrt_llm
from tensorrt_llm.runtime import SamplingConfig
kwargs = {
**self.config.__dict__,
**kwargs,
}
sampling_fields = SamplingConfig.__annotations__
if kwargs.get('do_sample', False):
seed = np.random.randint(
low=0, high=2**31, size=(batch_size,))
if self.world_size > 1:
tensorrt_llm._utils.mpi_comm().Bcast(seed, root=0)
seed = torch.tensor(seed)
else:
if "top_p" in kwargs:
kwargs["top_p"] = 0.0
if "top_k" in kwargs:
kwargs["top_k"] = 0
seed = None
sampling_config = SamplingConfig(
**{k: v for k, v in kwargs.items() if k in sampling_fields},
end_id=kwargs['eos_token_id'],
pad_id=kwargs['pad_token_id'],
)
sampling_config.random_seed = seed
return sampling_config
def generate(self, *args, **kwargs):
input_ids = kwargs['input_ids']
attention_mask = kwargs['attention_mask'].type(torch.bool)
batch_input_ids = [input_ids[idx][attention_mask[idx]]
for idx in range(input_ids.size(0))]
sampling_config = self.get_sample_config(input_ids.size(0), **kwargs)
import time
st = time.time()
with torch.no_grad():
outputs = self.runner.generate(
batch_input_ids,
sampling_config=sampling_config,
max_attention_window_size=None,
sink_token_length=None,
output_sequence_lengths=True,
return_dict=True)
torch.cuda.synchronize()
# return outputs
output_ids = outputs["output_ids"]
seq_lens = outputs["sequence_lengths"]
ct = time.time() - st
print(f"cost time: {ct :.4f} s")
output_ids = [
output_id[0][len(batch_input_ids[idx]):seq_len.item()] for idx, (output_id, seq_len) in enumerate(zip(output_ids, seq_lens))]
outputs = {
"input_ids": input_ids,
"output_ids": output_ids,
}
return outputs
Hey, this looks like the right set of changes. Would you be able to re-try with a longer input and shorter output length (ideally a gen length of 1 to isolate)? Reuse only applies to the context (prefill) phase, and it could be that your runtime is dominated by the generation phase, so any gains are not noticable.
Hey, this looks like the right set of changes. Would you be able to re-try with a longer input and shorter output length (ideally a gen length of 1 to isolate)? Reuse only applies to the context (prefill) phase, and it could be that your runtime is dominated by the generation phase, so any gains are not noticable.
@schetlur-nv Thanks for your reply, I tried output len=1, input len=400(same prefix 200), but the qps seems same when i open the prefix cache setting. (140 qps vs 139 qps)
@schetlur-nv - I have an application where given a context ABC and output_len as 3, I want to generate one token at a time and then call the llm with a fresh prompt with the new token for 3 times, i.e.,
request(ABC) -> llm -> D
request(ABCD) -> llm -> E
request(ABCDE) -> llm -> F
This is a bit different than the normal execution where ABC is a single request with 3 output token. In my case, I am calling llm 3 times each with new request prompt. Without block reuse, the KV has to be computed again for each new request. But with block reuse the new request will reuse the previous ones.
But since you said that only the prefill phase is saved, it means that in the 1st step when ABC was the reuqest, the llm computed the KV for D but only ABC is cached. So in the next request when the request is ABCD, the KV of D has to be computed again along with that of E. So each llm request has to generate 2 tokens in my case. Is this correct understanding?
@ekagra-ranjan - that is the behavior you'd see, but not because D's KV cache isn't saved from step 1 to 2, it is because of another restriction - we currently only reuse precomputed KV caches at the granularity of a cache block, so a single token will not be reused. However, for a long enough ABC (that is at least one cache block), you should see substantial improvement in time to first token going from step 1 (first computation) to 2 & 3 (reuse). The default KV cache page size is 128 tokens, so @littletomatodonkey 's example should actually work. Let us take a look at what may be happening.
@ekagra-ranjan - that is the behavior you'd see, but not because D's KV cache isn't saved from step 1 to 2, it is because of another restriction - we currently only reuse precomputed KV caches at the granularity of a cache block, so a single token will not be reused. However, for a long enough ABC (that is at least one cache block), you should see substantial improvement in time to first token going from step 1 (first computation) to 2 & 3 (reuse). The default KV cache page size is 128 tokens, so @littletomatodonkey 's example should actually work. Let us take a look at what may be happening.
Hi @schetlur-nv @ekagra-ranjan prefix cache worked for me finally, GptManager or executor must be used rather than ModelRunner or ModelRunnerCpp interface. enable_block_reuse also needs to be set. You may refer to: link
@ekagra-ranjan - that is the behavior you'd see, but not because D's KV cache isn't saved from step 1 to 2, it is because of another restriction - we currently only reuse precomputed KV caches at the granularity of a cache block, so a single token will not be reused. However, for a long enough ABC (that is at least one cache block), you should see substantial improvement in time to first token going from step 1 (first computation) to 2 & 3 (reuse). The default KV cache page size is 128 tokens,
@schetlur-nv - so for Draft Model approach in speculative sampling where the draft model outputs max_draft_len number of tokens and it goes to the Target model, if kv cache block size = 128 and max_draft_len = 5 then the Target model will recompute the KV cache of tokens until the size of kv cache becomes 128, right?
Lets assume Target model always agrees with Draft model for simplicity. So if we were reusing cache on "token level" then the Target model would run for 128/5 = 25.6 times but now with "block level" reuse the Target model runs for 5 + 10 + 15 + 20 + 25 + 30 + .... + 128/5 = 66.04 times since nothing is going to be reused unless we hit 128 tokens. Is this correct?
@ekagra-ranjan - that is the behavior you'd see, but not because D's KV cache isn't saved from step 1 to 2, it is because of another restriction - we currently only reuse precomputed KV caches at the granularity of a cache block, so a single token will not be reused. However, for a long enough ABC (that is at least one cache block), you should see substantial improvement in time to first token going from step 1 (first computation) to 2 & 3 (reuse). The default KV cache page size is 128 tokens, so @littletomatodonkey 's example should actually work. Let us take a look at what may be happening.
Hi @schetlur-nv @ekagra-ranjan prefix cache worked for me finally, GptManager or executor must be used rather than ModelRunner or ModelRunnerCpp interface.
enable_block_reusealso needs to be set. You may refer to: link
how much throughput gain do you observe?
Hi @schetlur-nv - in addition to this can you also share more detail on the working of enable kv cache reuse including the following plus anthing that might be of interest which is not mentioned in doc:
- In the doc it was not mentioned that cache is preserved on "block level" and not "token level", i.e. each token is not guaranteed to be cached but rather a block of tokens. We get to know this from the github issue.
- What is the cache eviction policy of the kv cache reuse approach?
@ekagra-ranjan - that is the behavior you'd see, but not because D's KV cache isn't saved from step 1 to 2, it is because of another restriction - we currently only reuse precomputed KV caches at the granularity of a cache block, so a single token will not be reused. However, for a long enough ABC (that is at least one cache block), you should see substantial improvement in time to first token going from step 1 (first computation) to 2 & 3 (reuse). The default KV cache page size is 128 tokens, so @littletomatodonkey 's example should actually work. Let us take a look at what may be happening.
Hi @schetlur-nv @ekagra-ranjan prefix cache worked for me finally, GptManager or executor must be used rather than ModelRunner or ModelRunnerCpp interface.
enable_block_reusealso needs to be set. You may refer to: linkhow much throughput gain do you observe?
Hi @renjie0 After using gptmanager rather than gptsession object, the prefix cache speedup finally works for me.
@schetlur-nv How can this be enabled through trtllm-build or launch_triton_server.py? Is this enabled by default?
edit: for future users, it's in config.pbtxt
Yes it's in the config file. Please refer to https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/kv_cache_reuse.md for more details.
@schetlur-nv, thank you very much. Regarding the info on that page, consider a prompt structured as:
"Please ask [user] how their day is going. Be sure to greet them by name".
If [user] changes every request, will the entire prompt after [user] be thrown out and regenerated, or if, supposing for sake of example, blocks contain of 2 tokens and each word is one token, would everything except "[user] how" be kept, and only that block reloaded?
The former - "Please ask" will be reused, and everything after regenerated. This is because mathematically, activations for all tokens after [user] will attend to [user] and therefore will depend on it.