Li Zhang comments

Results 72 comments of


                                            Li Zhang

trafficstars

Turbomind prefix caching

> In the current setting, only blocks with `use_count = 1` (only block trie holds the `use_count`) can be evicted. That means no sequence is using this block. After eviction,...

And `use_count` is for active sequences, sequences in interactive mode can still reference the same invalid block. When re-computation happens later, the sequences will allocate & refill the shared blocks...

Turbomind prefix caching

> preempt logic can be applied to solve the starvation problem. C with higher priority, so the blocks in later A_x will be preempted. Preemption won't work because `BlockTrie` holds...

Turbomind prefix caching

We still have some un-closed issues 1. With a batch of sequences sharing previously unseen (or evicted) prefixes, neither computation nor cache blocks are shared. 2. When something in the...

Turbomind prefix caching

@irexyc Pls help to check this will not break VLM when embedding inputs are present.

[Feature] Turbomind engine prefix caching

> Hi @lzhangzz @lvhan028 And do we have any plans to support [token attention](https://github.com/ModelTC/lightllm/blob/main/docs/TokenAttention.md) in TurboMind in the near future? Thanks. No it make no sense to me.

[Feature] Turbomind engine prefix caching

Turbomind's stateful inference is built on top of block level caching with LRU eviction policy. So there will be no conflict with prefix caching. The caching mechanism is implemented by...

[Feature] Turbomind engine prefix caching

> Maybe we need to figure out a new solution to reuse the last partially matched block. No need for that. It's possible to reduce the smallest block size to...

[Feature] Turbomind engine prefix caching

> When will this change plan to release? Likely in May. > Will it affect the performance? There may be slight degeneration in performance.

[Feature] support llava1.5 w4a16 model? the model is so slower than origin fp16 model?

What GPU model are you using？