[router] LSH based prefix cache aware router
🚀 Feature Description and Motivation
Right now, we're using xxhash in https://github.com/aibrix/aibrix/pull/641 for our prefix cache-aware router. We might consider switching to a consistent hash + LSH-based approach, which could reduce accuracy a bit but would simplify scaling. Here are some related discussions: https://github.com/vllm-project/production-stack/issues/59#issuecomment-2656740442.
Use Case
N/A
Proposed Solution
No response
@varungup90 @DwyaneShi can you spend some time on this issue?
It's just a proposal; I don't know if it helps in the chat use case. But it works well with long document QA. Ref https://github.com/vllm-project/production-stack/issues/59#issuecomment-2658633045
I will spend some time in implementing this as alternative to decision tree or composite metrics based algorithms
Is our prefix cache implementation similar to cacheblend or epic, because I saw the text are chunked. I say other approaches like kvshare: https://arxiv.org/abs/2410.18517, have we tried before?
@kerthcet A little it different. Currently, the primary work is still on vLLM's automatic prefix cache. without additional kv cache compressor or reuse capabilities. More on the routing side, I would say those are orthogonal, kvshare is also a different smart solution to share kv cache across layers. Most of the above solution has some side effects, I think it's not the in the scope yet
v0.3.0 has enough routing strategy invented and improved.
- Preble (Radix Tree + Prediction Based Load aware)
- Fairness
- Prefix Cache (Hashing Block) + heuristic Load aware
Due to limited bandwidth, I will move CHWBL(Consistent Hashing with With Bounded Loads) to next release.