I'm running the tutorial vllm/offline_inference_with_prefix.py and measuring the generation times, again below is the same code with generation times

` import argparse from typing import List, Tuple from transformers import AutoModelForCausalLM, AutoTokenizer from vllm import EngineArgs, LLMEngine, RequestOutput, SamplingParams

import time from vllm import LLM, SamplingParams

prefix = ( "You are an expert school principal, skilled in effectively managing " "faculty and staff. Draft 10-15 questions for a potential first grade " "Head Teacher for my K-12, all-girls', independent school that emphasizes " "community, joyful discovery, and life-long learning. The candidate is " "coming in for a first-round panel interview for a 8th grade Math " "teaching role. They have 5 years of previous teaching experience " "as an assistant teacher at a co-ed, public school with experience " "in middle school math teaching. Based on these information, fulfill " "the following paragraph: ")

Sample prompts.

prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ]

Create a sampling params object.

sampling_params = SamplingParams(temperature=0.0)

if name == 'main': # Create an LLM. llm = LLM(model="facebook/opt-125m")

generating_prompts = [prefix + prompt for prompt in prompts]

# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
st = time.perf_counter()
outputs = llm.generate(generating_prompts, sampling_params)
end = time.perf_counter()
print(f"without caching time:{end-st}")

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


print("-" * 80)

# -1 since the last token can change when concatenating prompts.
prefix_pos = len(llm.llm_engine.tokenizer.encode(prefix)) - 1

# The llm.generate call will batch all prompts and send the batch at once if resources allow.
# The prefix will only be cached after the first batch is processed, so we need to call generate once
# to calculate the prefix and cache it.
outputs = llm.generate(generating_prompts[0],
                    sampling_params,
                    prefix_pos=[prefix_pos])

# Subsequent batches can leverage the cached prefix
st = time.perf_counter()
outputs = llm.generate(generating_prompts,
                    sampling_params,
                    prefix_pos=[prefix_pos] * len(generating_prompts))
end = time.perf_counter()
print(f"with caching time:{end-st}")

# Print the outputs. You should see the same outputs as before
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

`

Output: with caching time:1.9611055543646216 without caching time:0.07439832389354706

VLLM: vllm==0.3.3

Mar 02 '24 01:03 vin136

The automatic prefix caching commit seems merged very recent and labeled as 0.3.4 release. So I assume some changes are not available on 0.3.3

Update: actually I just tested that PR https://github.com/vllm-project/vllm/pull/2762 and I can also confirm it's slower than original. I think the PR mentions that performance is not optimized currently.

Mar 03 '24 21:03 shixianc

Although I have not tested original prefix cache, I am seeing something strange that may point to the benchmark data being skewed.

One would assume like @vin136 that the second request on-ward would be fastest (due to caching). In my simple tests, the second request is always the slowest by a significant margin. At least for the master vllm checkout I did today with auto-prefix caching enabled.

So my suggestion is to collect data from request No.3 forward and not No.2 and see if new implementation is slower than old.

Mar 05 '24 04:03 Qubitium

Yup - we are working this week on optimizing the performance. Original PR focused on correctness.

Once we have performance we will focus on enabling by default

Mar 10 '24 22:03 robertgshaw2-redhat

@robertgshaw2-neuralmagic Do you have any idea why with prefix caching on, the second request is actually slower by a significant margin? I am repeating the request 3 times serially, with several second delay between each request. The order of speed from fastest to slowest is 3, 1, 2 where 2 is slowest by a huge margin. This seems counter-intuitive to most new users expecting how normal caching works.

Mar 11 '24 01:03 Qubitium

@robertgshaw2-neuralmagic Do you have any idea why with prefix caching on, the second request is actually slower by a significant margin? I am repeating the request 3 times serially, with several second delay between each request. The order of speed from fastest to slowest is 3, 1, 2 where 2 is slowest by a huge margin. This seems counter-intuitive to most new users expecting how normal caching works.

It's hard to tell without a bit more info about the request pattern. Can you share a snippet from the client code ?

In general, you should not expect to see any speedup at current. Its still experimental and we are working on performance of the eviction data structure

Mar 11 '24 23:03 robertgshaw2-redhat

@robertgshaw2-neuralmagic thanks, we're really looking forward for the optimization!

Also, could you clarify on the behavior of this feature:

in the same batch, first N tokens of the requests will be shared.
in the second batch, first N tokens of the rests will be shared with requests in the first batch. Which of the above is the expected behavior? Mainly the difference is that, do we need to let the vllm engine completes prompt processing phase once for 1 request, and then sending the rest common-prefix requests?

Mar 12 '24 16:03 shixianc

Just to confirm, not only it isn't optimized, but also not enabled by default right? If I run:

model_id = 'meta-llama/Llama-2-7b-hf'
llm = vllm.LLM(model=model_id)
cache_config = llm.llm_engine.cache_config
print(cache_config.__dict__.keys())

I get

dict_keys(['block_size', 'gpu_memory_utilization', 'swap_space_bytes', 'cache_dtype', 'sliding_window', 'num_gpu_blocks', 'num_cpu_blocks'])

The enable_prefix_caching argument is not there... and when I try to initialize an LLM object with the parameter, I get:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input [In [21]] llm2 = vllm.LLM(model=model_id, enable_prefix_caching=True)
...
TypeError: __init__() got an unexpected keyword argument 'enable_prefix_caching'

Mar 28 '24 06:03 thefirebanks

With the latest version(0.4.0), it seems we cannot enable prefix caching with Mistral-type models(sliding window attention).


if enable_caching and sliding_window is not None:
            raise NotImplementedError(
                "Sliding window is not allowed with prefix caching enabled!")

Any workaround or insights on why this is the case please

Apr 06 '24 04:04 vin136

@vin136 i think i saw this mentioned in a different issue. but for now you can go into your model config and manually change the sliding window size to null

Apr 08 '24 09:04 DavidPeleg6

Do you have any idea why with prefix caching on, the second request is actually slower by a significant margin? I am repeating the request 3 times serially, with several second delay between each request. The order of speed from fastest to slowest is 3, 1, 2 where 2 is slowest by a huge margin. This seems counter-intuitive to most new users expecting how normal caching works.

In my test, even though I used the same request three times, the response time was about the same, what does your test code look like?@Qubitium

Apr 08 '24 12:04 Maxppddcsz

Do you have any idea why with prefix caching on, the second request is actually slower by a significant margin? I am repeating the request 3 times serially, with several second delay between each request. The order of speed from fastest to slowest is 3, 1, 2 where 2 is slowest by a huge margin. This seems counter-intuitive to most new users expecting how normal caching works.

In my test, even though I used the same request three times, the response time was about the same, what does your test code look like?@Qubitium

I also met the same condition, do you have any idea 🤔?

May 07 '24 10:05 HillZhang1999

vllm
vllm copied to clipboard

Generation with Prefix-cache are slower than the ones without it ?

Sample prompts.

Create a sampling params object.

vllm vllm copied to clipboard

Generation with Prefix-cache are slower than the ones without it ?

Sample prompts.

Create a sampling params object.

vllm
vllm copied to clipboard