aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

[router] LSH based prefix cache aware router

Open gaocegege opened this issue 10 months ago • 6 comments

🚀 Feature Description and Motivation

Right now, we're using xxhash in https://github.com/aibrix/aibrix/pull/641 for our prefix cache-aware router. We might consider switching to a consistent hash + LSH-based approach, which could reduce accuracy a bit but would simplify scaling. Here are some related discussions: https://github.com/vllm-project/production-stack/issues/59#issuecomment-2656740442.

Use Case

N/A

Proposed Solution

No response

gaocegege avatar Feb 14 '25 04:02 gaocegege

@varungup90 @DwyaneShi can you spend some time on this issue?

Jeffwan avatar Feb 14 '25 18:02 Jeffwan

It's just a proposal; I don't know if it helps in the chat use case. But it works well with long document QA. Ref https://github.com/vllm-project/production-stack/issues/59#issuecomment-2658633045

gaocegege avatar Feb 15 '25 05:02 gaocegege

I will spend some time in implementing this as alternative to decision tree or composite metrics based algorithms

Jeffwan avatar Apr 07 '25 14:04 Jeffwan

Is our prefix cache implementation similar to cacheblend or epic, because I saw the text are chunked. I say other approaches like kvshare: https://arxiv.org/abs/2410.18517, have we tried before?

kerthcet avatar Apr 17 '25 03:04 kerthcet

@kerthcet A little it different. Currently, the primary work is still on vLLM's automatic prefix cache. without additional kv cache compressor or reuse capabilities. More on the routing side, I would say those are orthogonal, kvshare is also a different smart solution to share kv cache across layers. Most of the above solution has some side effects, I think it's not the in the scope yet

Jeffwan avatar Apr 20 '25 05:04 Jeffwan

v0.3.0 has enough routing strategy invented and improved.

  • Preble (Radix Tree + Prediction Based Load aware)
  • Fairness
  • Prefix Cache (Hashing Block) + heuristic Load aware

Due to limited bandwidth, I will move CHWBL(Consistent Hashing with With Bounded Loads) to next release.

Jeffwan avatar May 09 '25 19:05 Jeffwan