vllm
vllm copied to clipboard
Question about efficient memory sharing (prefix sharing)
I have a question about the feature of efficient memory sharing. Does different sequences that sharing the same system prompt but splicing different user-input texts share the computation and memory for the same system prompt?
For example, here are two input sequences:
- <|system|>You are a kind robot. <|user|>How's the weather today.
- <|system|>You are a kind robot. <|user|>Tell me a story.
Would this two input sequences share the computation and memory for the same system prompt of "<|system|>You are a kind robot. <|user|>"?
Thanks for bringing this up! Indeed, prefix sharing is an excellent scenario to save even more memory and compute. We evaluated this setting in our research paper. However, our current implementation of the PagedAttention kernel with query sequence length > 1 is buggy and slow, so we didn't include it in our original release. We plan to add this feature in the future.
+1 for this use case. This could be hugely impactful. Is this ticket the best way to track the status of this feature request?
Even with query sequence length 1 if we can mark all tokens from prefixes to be persistent in cache - it could bring some speed up to inference.
Even with query sequence length 1 if we can mark all tokens from prefixes to be persistent in cache - it could bring some speed up to inference.
+1
I have implemented a simple version of the prefix cache function, which shows significant performance improvement in specific scenarios. Do you require this feature? If so, I can prepare a detailed design plan for your review. There will be many places in the code that need changes, so I will proceed with the development after your review. @zhuohan123
This is for a performance test:
- Compared to the base limit, the prefix cache increases throughput by 29%. At a high QPS (15 QPS), the time consumption for the first token decreases, and the average latency per request decreases by more than 60%.
- The performance of the prefix cache is related to both the prefix length and the input length.
For each request, the prefix length is 200, the input length is 30, and the output length is 50.
Load (QPS) | Method | Requests/s | Average Latency per Req | First Token Time |
---|---|---|---|---|
10 QPS | Prefix Cache | 9.83 requests/s | 1.97 s | 0.29 s |
10 QPS | Base | 9.80 requests/s | 2.87 s | 0.45 s |
15 QPS | Prefix Cache | 14.30 requests/s | 2.98 s | 0.39 s |
15 QPS | Base | 13.24 requests/s | 8.65 s | 1.02 s |
25 QPS | Prefix Cache | 19.81 requests/s | 6.46 s | 0.84 s |
25 QPS | Base | 14.08 requests/s | 13.67 s | 4.74 s |
I have implemented a simple version of the prefix cache function, which shows significant performance improvement in specific scenarios. Do you require this feature? If so, I can prepare a detailed design plan for your review. There will be many places in the code that need changes, so I will proceed with the development after your review. @zhuohan123
This is for a performance test:
- Compared to the base limit, the prefix cache increases throughput by 29%. At a high QPS (15 QPS), the time consumption for the first token decreases, and the average latency per request decreases by more than 60%.
- The performance of the prefix cache is related to both the prefix length and the input length.
For each request, the prefix length is 200, the input length is 30, and the output length is 50.
Load (QPS) Method Requests/s Average Latency per Req First Token Time 10 QPS Prefix Cache 9.83 requests/s 1.97 s 0.29 s 10 QPS Base 9.80 requests/s 2.87 s 0.45 s 15 QPS Prefix Cache 14.30 requests/s 2.98 s 0.39 s 15 QPS Base 13.24 requests/s 8.65 s 1.02 s 25 QPS Prefix Cache 19.81 requests/s 6.46 s 0.84 s 25 QPS Base 14.08 requests/s 13.67 s 4.74 s
Great work! I have a question, does the QPS in your table refer to the number of concurrent requests? In my understanding, "Requests/s" should be QPS. If I am wrong, please correct me, thank you!
I have implemented a simple version of the prefix cache function, which shows significant performance improvement in specific scenarios. Do you require this feature? If so, I can prepare a detailed design plan for your review. There will be many places in the code that need changes, so I will proceed with the development after your review. @zhuohan123
This is for a performance test:
- Compared to the base limit, the prefix cache increases throughput by 29%. At a high QPS (15 QPS), the time consumption for the first token decreases, and the average latency per request decreases by more than 60%.
- The performance of the prefix cache is related to both the prefix length and the input length.
For each request, the prefix length is 200, the input length is 30, and the output length is 50. Load (QPS) Method Requests/s Average Latency per Req First Token Time 10 QPS Prefix Cache 9.83 requests/s 1.97 s 0.29 s 10 QPS Base 9.80 requests/s 2.87 s 0.45 s 15 QPS Prefix Cache 14.30 requests/s 2.98 s 0.39 s 15 QPS Base 13.24 requests/s 8.65 s 1.02 s 25 QPS Prefix Cache 19.81 requests/s 6.46 s 0.84 s 25 QPS Base 14.08 requests/s 13.67 s 4.74 s
Great work! I have a question, does the QPS in your table refer to the number of concurrent requests? In my understanding, "Requests/s" should be QPS. If I am wrong, please correct me, thank you!
The first column, QPS, represents the number of requests per second. The 'Requests/s' column can be understood as the throughput under the current QPS.
@sleepcoo Any way I could be helpful here? I am interested in working on this too.
@sleepcoo Any way I could be helpful here? I am interested in working on this too.
You can try the implementation at https://github.com/vllm-project/vllm/pull/1669, it's quite comprehensive. I've given up on my implementation :disappointed: @jadielam
@sleepcoo Any way I could be helpful here? I am interested in working on this too.
You can try the implementation at #1669, it's quite comprehensive. I've given up on my implementation 😞 @jadielam
Thanks for the pointer. This will save me some time.