Ke Bao

Results 60 comments of Ke Bao

> The idea of "occupied blocks can't be evicted" breaks the relaxed FCFS scheduling and may lead to starvation. @lzhangzz preempt logic can be applied to solve the starvation problem....

@lzhangzz - The first one is expected. Now we only cache and reuse the computed blocks to avoid write conflict. In this design, the blocks reuse may have one iteration...

@lvhan028 1. Even though we did the check in server level, the kernel crash issue still need to be fixed. ``` RuntimeError: [TM][ERROR] CUDA runtime error: an illegal memory access...

> Will this proposal conflict with turbomind's stateful inference? No conflict I think. In stateful inference, the history cache is prioritized. The block hash will involve all the prefix till...

Sequence is composed of blocks, and block is the smallest unit of reuse, so the smallest unit for cache management and prefix matching should be block, not token. So we...

For vllm, we tested the prefix caching and found there is almost 20% performance improvement on SharedGPT dataset with manually added system prompts. So this feature may have great benefits...

In the newest Turbomind engine, the smallest `block_size` is `64`. The length of prefix(system prompts) is usually `100~200`. If we did block level reuse (like the `Block Trie` mentioned above),...

> No need for that. It's possible to reduce the smallest block size to 16 in the future. @lzhangzz When will this change plan to release?

> No need for that. It's possible to reduce the smallest block size to 16 in the future. And what's the side effect of reducing the smallest block size? Will...

@grimoire We compared the prefix cache implementation for other projects: - [vllm](https://github.com/vllm-project/vllm/issues/2614) - Hash Table - compute hash key for each block: `hash(prefix tokens, tokens in this block)` - block...