TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

[Question]Any documents for the feature "prefix caching?"

Open littletomatodonkey opened this issue 1 year ago • 3 comments
trafficstars

Hi, thanks for your great job. According to the issue: #620 , prefix caching is supported in TensorRT-llm. I'd like to know is there any documents for the feature? Thanks !

littletomatodonkey avatar Feb 04 '24 05:02 littletomatodonkey

Here are some document https://github.com/NVIDIA/TensorRT-LLM/blob/3d56a445e8ebf888e78be638faf6beec0a78f3c2/docs/source/batch_manager.md?plain=1#L137.

byshiue avatar Feb 04 '24 09:02 byshiue

Thanks, how can i use this flag in the process of llama export and inference? seems that no kvCacheConfig is set in build.py or ../run.py

littletomatodonkey avatar Feb 04 '24 13:02 littletomatodonkey

@byshiue Hi, i tested here.

https://github.com/NVIDIA/TensorRT-LLM/blob/3d56a445e8ebf888e78be638faf6beec0a78f3c2/tensorrt_llm/runtime/model_runner_cpp.py#L166

and i add enable_block_reuse=True, for KvCacheConfig building process, but the profiles shows that it's same between enable_block_reuse=True and enable_block_reuse=False.

I tested the same prompt for 10 times, and warmup is also into consideration. The model is Yi-6B-Chat and the script i use is run.py.

Can you help to have a look? Thanks!

littletomatodonkey avatar Feb 05 '24 10:02 littletomatodonkey

@byshiue Hi, i tested here.

https://github.com/NVIDIA/TensorRT-LLM/blob/3d56a445e8ebf888e78be638faf6beec0a78f3c2/tensorrt_llm/runtime/model_runner_cpp.py#L166

and i add enable_block_reuse=True, for KvCacheConfig building process, but the profiles shows that it's same between enable_block_reuse=True and enable_block_reuse=False.

I tested the same prompt for 10 times, and warmup is also into consideration. The model is Yi-6B-Chat and the script i use is run.py.

Can you help to have a look? Thanks!

Does prefix caching need to be combined with a specific kv chunking method? For example, if I want add enable_block_reuse=True,I should also set enable paged_kv_cache when building.

shiqingzhangCSU avatar Mar 21 '24 11:03 shiqingzhangCSU

@byshiue Hi, i tested here.

https://github.com/NVIDIA/TensorRT-LLM/blob/3d56a445e8ebf888e78be638faf6beec0a78f3c2/tensorrt_llm/runtime/model_runner_cpp.py#L166

and i add enable_block_reuse=True, for KvCacheConfig building process, but the profiles shows that it's same between enable_block_reuse=True and enable_block_reuse=False.

I tested the same prompt for 10 times, and warmup is also into consideration. The model is Yi-6B-Chat and the script i use is run.py.

Can you help to have a look? Thanks!

I have same situation, and the profiles shows that it's same prefill time between enable_block_reuse=True and enable_block_reuse=False.

shiqingzhangCSU avatar Mar 22 '24 09:03 shiqingzhangCSU

To enable enable_block_reuse, you need to enable paged_kv_cache and use_paged_context_fmha.

byshiue avatar Mar 27 '24 08:03 byshiue

do you know how to use system prompt caching in qwen1.5?

MuyeMikeZhang avatar Apr 03 '24 09:04 MuyeMikeZhang

It is same on all models. You need to enable paged_kv_cache and use_paged_context_fmha.

byshiue avatar Apr 10 '24 09:04 byshiue

It is same on all models. You need to enable paged_kv_cache and use_paged_context_fmha.

hi, thanks for your reply, i set in the scene which contains ~200 same prefix tokens (100 different, and about 400 output). The qps seems same (before set: 29.366qps, after set: 29.585qps). I use ModelRunnerCpp based on TensorRT-LLM 090, could be please help to have a look at the performance? Thanks!

--paged_kv_cache "enable" \
--use_paged_context_fmha "enable" \

Whole conversion scripts

python convert_checkpoint.py \
--model_dir ${hf_model_dir} \
--output_dir ${tmp_dir} \
--dtype float16

trtllm-build \
--checkpoint_dir ${tmp_dir} \
--output_dir ${trt_model_dir} \
--remove_input_padding "enable" \
--gpt_attention_plugin "float16" \
--context_fmha "enable" \
--gemm_plugin="float16" \
--max_batch_size 256 \
--max_input_len 1024 \
--max_output_len 800 \
--paged_kv_cache "enable" \
--use_paged_context_fmha "enable"

littletomatodonkey avatar Apr 12 '24 13:04 littletomatodonkey

any follow up

renjie0 avatar Apr 16 '24 07:04 renjie0

@littletomatodonkey Could you also share the scripts to run inference?

byshiue avatar Apr 17 '24 00:04 byshiue

I think the above is happening because we need to set enableBlockReuse as True explicitly but the flag is not exposed for the user to toggle.

Python binding cpp runtime

@byshiue can you confirm if this is the case?

ekagra-ranjan avatar Apr 17 '24 04:04 ekagra-ranjan

I think the above is happening because we need to set enableBlockReuse as True explicitly but the flag is not exposed for the user to toggle.

Python binding cpp runtime

@byshiue can you confirm if this is the case?

Hi, @ekagra-ranjan I fix the code in TensorRT-LLM, which is as follows. But the QPS does not change, seems that it's a problem but not the key-point.

        session_config.kv_cache_config = KvCacheConfig(
            free_gpu_memory_fraction=free_gpu_memory_fraction,
            max_attention_window=max_attention_window_size,
            sink_token_length=sink_token_length,
            enable_block_reuse=True)

littletomatodonkey avatar Apr 17 '24 06:04 littletomatodonkey

@littletomatodonkey Could you also share the scripts to run inference?

@byshiue Hi, this is my inference code.

class TRTModel090(object):
    def __init__(self, config, *inputs, **kwargs):
        import tensorrt_llm
        from tensorrt_llm.runtime import ModelRunnerCpp as ModelRunner
        super().__init__()

        self.world_size = tensorrt_llm.mpi_world_size()
        self.runtime_rank = tensorrt_llm.mpi_rank()

        runner_kwargs = dict(engine_dir=config.name_or_path,
                             rank=self.runtime_rank)
        self.runner = ModelRunner.from_dir(**runner_kwargs)

    def get_sample_config(self, batch_size: int, **kwargs):
        import tensorrt_llm
        from tensorrt_llm.runtime import SamplingConfig
        kwargs = {
            **self.config.__dict__,
            **kwargs,
        }
        sampling_fields = SamplingConfig.__annotations__
        if kwargs.get('do_sample', False):
            seed = np.random.randint(
                low=0, high=2**31, size=(batch_size,))
            if self.world_size > 1:
                tensorrt_llm._utils.mpi_comm().Bcast(seed, root=0)
            seed = torch.tensor(seed)
        else:
            if "top_p" in kwargs:
                kwargs["top_p"] = 0.0
            if "top_k" in kwargs:
                kwargs["top_k"] = 0
            seed = None

        sampling_config = SamplingConfig(
            **{k: v for k, v in kwargs.items() if k in sampling_fields},
            end_id=kwargs['eos_token_id'],
            pad_id=kwargs['pad_token_id'],
        )
        sampling_config.random_seed = seed
        return sampling_config

    def generate(self, *args, **kwargs):
        input_ids = kwargs['input_ids']
        attention_mask = kwargs['attention_mask'].type(torch.bool)
        batch_input_ids = [input_ids[idx][attention_mask[idx]]
                           for idx in range(input_ids.size(0))]
        sampling_config = self.get_sample_config(input_ids.size(0), **kwargs)

        import time
        st = time.time()
        with torch.no_grad():
            outputs = self.runner.generate(
                batch_input_ids,
                sampling_config=sampling_config,
                max_attention_window_size=None,
                sink_token_length=None,
                output_sequence_lengths=True,
                return_dict=True)
            torch.cuda.synchronize()
        # return outputs
        output_ids = outputs["output_ids"]
        seq_lens = outputs["sequence_lengths"]
        ct = time.time() - st
        print(f"cost time: {ct :.4f} s")
        output_ids = [
            output_id[0][len(batch_input_ids[idx]):seq_len.item()] for idx, (output_id, seq_len) in enumerate(zip(output_ids, seq_lens))]
        outputs = {
            "input_ids": input_ids,
            "output_ids": output_ids,
        }
        return outputs

littletomatodonkey avatar Apr 17 '24 06:04 littletomatodonkey

Hey, this looks like the right set of changes. Would you be able to re-try with a longer input and shorter output length (ideally a gen length of 1 to isolate)? Reuse only applies to the context (prefill) phase, and it could be that your runtime is dominated by the generation phase, so any gains are not noticable.

schetlur-nv avatar Apr 19 '24 16:04 schetlur-nv

Hey, this looks like the right set of changes. Would you be able to re-try with a longer input and shorter output length (ideally a gen length of 1 to isolate)? Reuse only applies to the context (prefill) phase, and it could be that your runtime is dominated by the generation phase, so any gains are not noticable.

@schetlur-nv Thanks for your reply, I tried output len=1, input len=400(same prefix 200), but the qps seems same when i open the prefix cache setting. (140 qps vs 139 qps)

littletomatodonkey avatar Apr 19 '24 23:04 littletomatodonkey

@schetlur-nv - I have an application where given a context ABC and output_len as 3, I want to generate one token at a time and then call the llm with a fresh prompt with the new token for 3 times, i.e., request(ABC) -> llm -> D request(ABCD) -> llm -> E request(ABCDE) -> llm -> F

This is a bit different than the normal execution where ABC is a single request with 3 output token. In my case, I am calling llm 3 times each with new request prompt. Without block reuse, the KV has to be computed again for each new request. But with block reuse the new request will reuse the previous ones.

But since you said that only the prefill phase is saved, it means that in the 1st step when ABC was the reuqest, the llm computed the KV for D but only ABC is cached. So in the next request when the request is ABCD, the KV of D has to be computed again along with that of E. So each llm request has to generate 2 tokens in my case. Is this correct understanding?

ekagra-ranjan avatar Apr 19 '24 23:04 ekagra-ranjan

@ekagra-ranjan - that is the behavior you'd see, but not because D's KV cache isn't saved from step 1 to 2, it is because of another restriction - we currently only reuse precomputed KV caches at the granularity of a cache block, so a single token will not be reused. However, for a long enough ABC (that is at least one cache block), you should see substantial improvement in time to first token going from step 1 (first computation) to 2 & 3 (reuse). The default KV cache page size is 128 tokens, so @littletomatodonkey 's example should actually work. Let us take a look at what may be happening.

schetlur-nv avatar Apr 22 '24 21:04 schetlur-nv

@ekagra-ranjan - that is the behavior you'd see, but not because D's KV cache isn't saved from step 1 to 2, it is because of another restriction - we currently only reuse precomputed KV caches at the granularity of a cache block, so a single token will not be reused. However, for a long enough ABC (that is at least one cache block), you should see substantial improvement in time to first token going from step 1 (first computation) to 2 & 3 (reuse). The default KV cache page size is 128 tokens, so @littletomatodonkey 's example should actually work. Let us take a look at what may be happening.

Hi @schetlur-nv @ekagra-ranjan prefix cache worked for me finally, GptManager or executor must be used rather than ModelRunner or ModelRunnerCpp interface. enable_block_reuse also needs to be set. You may refer to: link

littletomatodonkey avatar Apr 23 '24 11:04 littletomatodonkey

@ekagra-ranjan - that is the behavior you'd see, but not because D's KV cache isn't saved from step 1 to 2, it is because of another restriction - we currently only reuse precomputed KV caches at the granularity of a cache block, so a single token will not be reused. However, for a long enough ABC (that is at least one cache block), you should see substantial improvement in time to first token going from step 1 (first computation) to 2 & 3 (reuse). The default KV cache page size is 128 tokens,

@schetlur-nv - so for Draft Model approach in speculative sampling where the draft model outputs max_draft_len number of tokens and it goes to the Target model, if kv cache block size = 128 and max_draft_len = 5 then the Target model will recompute the KV cache of tokens until the size of kv cache becomes 128, right?

Lets assume Target model always agrees with Draft model for simplicity. So if we were reusing cache on "token level" then the Target model would run for 128/5 = 25.6 times but now with "block level" reuse the Target model runs for 5 + 10 + 15 + 20 + 25 + 30 + .... + 128/5 = 66.04 times since nothing is going to be reused unless we hit 128 tokens. Is this correct?

ekagra-ranjan avatar Apr 23 '24 14:04 ekagra-ranjan

@ekagra-ranjan - that is the behavior you'd see, but not because D's KV cache isn't saved from step 1 to 2, it is because of another restriction - we currently only reuse precomputed KV caches at the granularity of a cache block, so a single token will not be reused. However, for a long enough ABC (that is at least one cache block), you should see substantial improvement in time to first token going from step 1 (first computation) to 2 & 3 (reuse). The default KV cache page size is 128 tokens, so @littletomatodonkey 's example should actually work. Let us take a look at what may be happening.

Hi @schetlur-nv @ekagra-ranjan prefix cache worked for me finally, GptManager or executor must be used rather than ModelRunner or ModelRunnerCpp interface. enable_block_reuse also needs to be set. You may refer to: link

how much throughput gain do you observe?

renjie0 avatar Apr 24 '24 16:04 renjie0

Hi @schetlur-nv - in addition to this can you also share more detail on the working of enable kv cache reuse including the following plus anthing that might be of interest which is not mentioned in doc:

  • In the doc it was not mentioned that cache is preserved on "block level" and not "token level", i.e. each token is not guaranteed to be cached but rather a block of tokens. We get to know this from the github issue.
  • What is the cache eviction policy of the kv cache reuse approach?

ekagra-ranjan avatar Apr 24 '24 20:04 ekagra-ranjan

@ekagra-ranjan - that is the behavior you'd see, but not because D's KV cache isn't saved from step 1 to 2, it is because of another restriction - we currently only reuse precomputed KV caches at the granularity of a cache block, so a single token will not be reused. However, for a long enough ABC (that is at least one cache block), you should see substantial improvement in time to first token going from step 1 (first computation) to 2 & 3 (reuse). The default KV cache page size is 128 tokens, so @littletomatodonkey 's example should actually work. Let us take a look at what may be happening.

Hi @schetlur-nv @ekagra-ranjan prefix cache worked for me finally, GptManager or executor must be used rather than ModelRunner or ModelRunnerCpp interface. enable_block_reuse also needs to be set. You may refer to: link

how much throughput gain do you observe?

Hi @renjie0 After using gptmanager rather than gptsession object, the prefix cache speedup finally works for me.

littletomatodonkey avatar May 02 '24 14:05 littletomatodonkey

@schetlur-nv How can this be enabled through trtllm-build or launch_triton_server.py? Is this enabled by default?

edit: for future users, it's in config.pbtxt

ethan-digi avatar May 16 '24 21:05 ethan-digi

Yes it's in the config file. Please refer to https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/kv_cache_reuse.md for more details.

schetlur-nv avatar May 16 '24 22:05 schetlur-nv

@schetlur-nv, thank you very much. Regarding the info on that page, consider a prompt structured as:

"Please ask [user] how their day is going. Be sure to greet them by name".

If [user] changes every request, will the entire prompt after [user] be thrown out and regenerated, or if, supposing for sake of example, blocks contain of 2 tokens and each word is one token, would everything except "[user] how" be kept, and only that block reloaded?

ethan-digi avatar May 16 '24 23:05 ethan-digi

The former - "Please ask" will be reused, and everything after regenerated. This is because mathematically, activations for all tokens after [user] will attend to [user] and therefore will depend on it.

schetlur-nv avatar May 17 '24 20:05 schetlur-nv