Ke Bao comments

Results 60 comments of


                                            Ke Bao

Turbomind prefix caching

> The idea of "occupied blocks can't be evicted" breaks the relaxed FCFS scheduling and may lead to starvation. @lzhangzz preempt logic can be applied to solve the starvation problem....

Turbomind prefix caching

@lzhangzz - The first one is expected. Now we only cache and reuse the computed blocks to avoid write conflict. In this design, the blocks reuse may have one iteration...

[Bug] Turbomind crash when client pass illegal top_k

@lvhan028 1. Even though we did the check in server level, the kernel crash issue still need to be fixed. ``` RuntimeError: [TM][ERROR] CUDA runtime error: an illegal memory access...

[Feature] Turbomind engine prefix caching

> Will this proposal conflict with turbomind's stateful inference? No conflict I think. In stateful inference, the history cache is prioritized. The block hash will involve all the prefix till...

[Feature] Turbomind engine prefix caching

Sequence is composed of blocks, and block is the smallest unit of reuse, so the smallest unit for cache management and prefix matching should be block, not token. So we...

[Feature] Turbomind engine prefix caching

For vllm, we tested the prefix caching and found there is almost 20% performance improvement on SharedGPT dataset with manually added system prompts. So this feature may have great benefits...

[Feature] Turbomind engine prefix caching

In the newest Turbomind engine, the smallest `block_size` is `64`. The length of prefix(system prompts) is usually `100~200`. If we did block level reuse (like the `Block Trie` mentioned above),...

[Feature] Turbomind engine prefix caching

> No need for that. It's possible to reduce the smallest block size to 16 in the future. @lzhangzz When will this change plan to release?

[Feature] Turbomind engine prefix caching

> No need for that. It's possible to reduce the smallest block size to 16 in the future. And what's the side effect of reducing the smallest block size? Will...

@grimoire We compared the prefix cache implementation for other projects: - [vllm](https://github.com/vllm-project/vllm/issues/2614) - Hash Table - compute hash key for each block: `hash(prefix tokens, tokens in this block)` - block...