vllm Question about efficient memory sharing (prefix sharing)

I have a question about the feature of efficient memory sharing. Does different sequences that sharing the same system prompt but splicing different user-input texts share the computation and memory for the same system prompt?

For example, here are two input sequences:

<|system|>You are a kind robot. <|user|>How's the weather today.
<|system|>You are a kind robot. <|user|>Tell me a story.

Would this two input sequences share the computation and memory for the same system prompt of "<|system|>You are a kind robot. <|user|>"?

Jun 24 '23 07:06 xyfZzz

Thanks for bringing this up! Indeed, prefix sharing is an excellent scenario to save even more memory and compute. We evaluated this setting in our research paper. However, our current implementation of the PagedAttention kernel with query sequence length > 1 is buggy and slow, so we didn't include it in our original release. We plan to add this feature in the future.

Jun 25 '23 17:06 zhuohan123

+1 for this use case. This could be hugely impactful. Is this ticket the best way to track the status of this feature request?

Aug 09 '23 20:08 physicsrob

Even with query sequence length 1 if we can mark all tokens from prefixes to be persistent in cache - it could bring some speed up to inference.

Sep 22 '23 17:09 gleberof-ai

Even with query sequence length 1 if we can mark all tokens from prefixes to be persistent in cache - it could bring some speed up to inference.

+1

Nov 20 '23 02:11 firebook

I have implemented a simple version of the prefix cache function, which shows significant performance improvement in specific scenarios. Do you require this feature? If so, I can prepare a detailed design plan for your review. There will be many places in the code that need changes, so I will proceed with the development after your review. @zhuohan123

This is for a performance test:

Compared to the base limit, the prefix cache increases throughput by 29%. At a high QPS (15 QPS), the time consumption for the first token decreases, and the average latency per request decreases by more than 60%.
The performance of the prefix cache is related to both the prefix length and the input length.

For each request, the prefix length is 200, the input length is 30, and the output length is 50.

Load (QPS)	Method	Requests/s	Average Latency per Req	First Token Time
10 QPS	Prefix Cache	9.83 requests/s	1.97 s	0.29 s
10 QPS	Base	9.80 requests/s	2.87 s	0.45 s
15 QPS	Prefix Cache	14.30 requests/s	2.98 s	0.39 s
15 QPS	Base	13.24 requests/s	8.65 s	1.02 s
25 QPS	Prefix Cache	19.81 requests/s	6.46 s	0.84 s
25 QPS	Base	14.08 requests/s	13.67 s	4.74 s

Dec 04 '23 07:12 sleepcoo

I have implemented a simple version of the prefix cache function, which shows significant performance improvement in specific scenarios. Do you require this feature? If so, I can prepare a detailed design plan for your review. There will be many places in the code that need changes, so I will proceed with the development after your review. @zhuohan123

This is for a performance test:

Compared to the base limit, the prefix cache increases throughput by 29%. At a high QPS (15 QPS), the time consumption for the first token decreases, and the average latency per request decreases by more than 60%.

The performance of the prefix cache is related to both the prefix length and the input length.

For each request, the prefix length is 200, the input length is 30, and the output length is 50.

Load (QPS) Method Requests/s Average Latency per Req First Token Time 10 QPS Prefix Cache 9.83 requests/s 1.97 s 0.29 s 10 QPS Base 9.80 requests/s 2.87 s 0.45 s 15 QPS Prefix Cache 14.30 requests/s 2.98 s 0.39 s 15 QPS Base 13.24 requests/s 8.65 s 1.02 s 25 QPS Prefix Cache 19.81 requests/s 6.46 s 0.84 s 25 QPS Base 14.08 requests/s 13.67 s 4.74 s

Great work! I have a question, does the QPS in your table refer to the number of concurrent requests? In my understanding, "Requests/s" should be QPS. If I am wrong, please correct me, thank you!

Dec 04 '23 08:12 xyfZzz

I have implemented a simple version of the prefix cache function, which shows significant performance improvement in specific scenarios. Do you require this feature? If so, I can prepare a detailed design plan for your review. There will be many places in the code that need changes, so I will proceed with the development after your review. @zhuohan123

This is for a performance test:

Compared to the base limit, the prefix cache increases throughput by 29%. At a high QPS (15 QPS), the time consumption for the first token decreases, and the average latency per request decreases by more than 60%.

The performance of the prefix cache is related to both the prefix length and the input length.

For each request, the prefix length is 200, the input length is 30, and the output length is 50. Load (QPS) Method Requests/s Average Latency per Req First Token Time 10 QPS Prefix Cache 9.83 requests/s 1.97 s 0.29 s 10 QPS Base 9.80 requests/s 2.87 s 0.45 s 15 QPS Prefix Cache 14.30 requests/s 2.98 s 0.39 s 15 QPS Base 13.24 requests/s 8.65 s 1.02 s 25 QPS Prefix Cache 19.81 requests/s 6.46 s 0.84 s 25 QPS Base 14.08 requests/s 13.67 s 4.74 s

Great work! I have a question, does the QPS in your table refer to the number of concurrent requests? In my understanding, "Requests/s" should be QPS. If I am wrong, please correct me, thank you!

The first column, QPS, represents the number of requests per second. The 'Requests/s' column can be understood as the throughput under the current QPS.

Dec 04 '23 08:12 sleepcoo

@sleepcoo Any way I could be helpful here? I am interested in working on this too.

Jan 04 '24 00:01 jadielam

@sleepcoo Any way I could be helpful here? I am interested in working on this too.

You can try the implementation at https://github.com/vllm-project/vllm/pull/1669, it's quite comprehensive. I've given up on my implementation :disappointed: @jadielam

Jan 04 '24 06:01 sleepcoo

@sleepcoo Any way I could be helpful here? I am interested in working on this too.

You can try the implementation at #1669, it's quite comprehensive. I've given up on my implementation 😞 @jadielam

Thanks for the pointer. This will save me some time.

Jan 04 '24 12:01 jadielam

vllm vllm copied to clipboard

Question about efficient memory sharing (prefix sharing)

This is for a performance test:

This is for a performance test:

This is for a performance test:

vllm
vllm copied to clipboard