Jiaxin Shan
Jiaxin Shan
Introducing tokenization brings some complexity on tokenizer managed (if we want every model uses their own tokenizer) as well. We need to consider the benefits and at least make this...
@gaocegege Originally, I think the token based solution could be more aligned with the "page" tokens in vLLM and chunk by chunk alignment would be tidy comparing to two different...
@varungup90 @DwyaneShi can you spend some time on this issue?
I will spend some time in implementing this as alternative to decision tree or composite metrics based algorithms
@kerthcet A little it different. Currently, the primary work is still on vLLM's automatic prefix cache. without additional kv cache compressor or reuse capabilities. More on the routing side, I...
v0.3.0 has enough routing strategy invented and improved. - Preble (Radix Tree + Prediction Based Load aware) - Fairness - Prefix Cache (Hashing Block) + heuristic Load aware Due to...
@a-mccarthy I will work on the blog once the enhancement PR is merged.
@rashansmith thanks for the feedback. The code PR has been merged and I am starting to draft the blog post today.
I made some draft and update the PR here. It still needs some diagrams, contents and I will try to make it soon
@rashansmith sorry for the delay. I add the diagrams and polish some paragraphs and this should be in a good shape for reviewing now. Please take a look