vllm [Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching

[Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching

Open llsj14 opened this issue 5 months ago • 2 comments

Summary

Block Manager v2, unlike v1, did not support LoRA and prompt adapter for the block hash in prefix caching mode. I added logic to inject the LoRA ID and prompt adapter ID into the block hash function to support LoRA and prompt adapter while using prefix caching mode with block manager v2.

Detail

Block Manager v1 uses the following hash_of_block function to generate a content hash in prefix caching mode: https://github.com/vllm-project/vllm/blob/baa5467547a758af35f442af6edfbc0fb73c83ce/vllm/sequence.py#L460-L468

However, Block Manager v2 only uses token IDs, as shown here: https://github.com/vllm-project/vllm/blob/baa5467547a758af35f442af6edfbc0fb73c83ce/vllm/core/block/prefix_caching_block.py#L855

Sep 06 '24 14:09 llsj14

vllm vllm copied to clipboard

[Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching

Summary

Detail

vllm
vllm copied to clipboard