Li Zhang

Results 72 comments of Li Zhang
trafficstars

> In the current setting, only blocks with `use_count = 1` (only block trie holds the `use_count`) can be evicted. That means no sequence is using this block. After eviction,...

And `use_count` is for active sequences, sequences in interactive mode can still reference the same invalid block. When re-computation happens later, the sequences will allocate & refill the shared blocks...

> preempt logic can be applied to solve the starvation problem. C with higher priority, so the blocks in later A_x will be preempted. Preemption won't work because `BlockTrie` holds...

We still have some un-closed issues 1. With a batch of sequences sharing previously unseen (or evicted) prefixes, neither computation nor cache blocks are shared. 2. When something in the...

@irexyc Pls help to check this will not break VLM when embedding inputs are present.

> Hi @lzhangzz @lvhan028 And do we have any plans to support [token attention](https://github.com/ModelTC/lightllm/blob/main/docs/TokenAttention.md) in TurboMind in the near future? Thanks. No it make no sense to me.

Turbomind's stateful inference is built on top of block level caching with LRU eviction policy. So there will be no conflict with prefix caching. The caching mechanism is implemented by...

> Maybe we need to figure out a new solution to reuse the last partially matched block. No need for that. It's possible to reduce the smallest block size to...

> When will this change plan to release? Likely in May. > Will it affect the performance? There may be slight degeneration in performance.