lmdeploy icon indicating copy to clipboard operation
lmdeploy copied to clipboard

[Performance] How many gains does Interactive Inference Mode contribute to the throughput?

Open nullxjx opened this issue 1 year ago • 1 comments

📚 The doc issue

I see one of the key features of lmdeploy is the "Interactive Inference Mode". As far as I know, most of the existing llm serving frameworks only record the kv cache during one dialogue, they usually do not record the cache cross multi dialogues. KV cache cross multi dialogues can avoid lots of repetitive caculation of self attention thus can improve system throughput (let's say tokens per second). So I'm pretty interested of how many gains does Interactive Inference Mode contribute to the throughput?

Suggest a potential alternative/fix

No response

nullxjx avatar Oct 17 '23 02:10 nullxjx

I'm also interested in this. glad if someone could help to answer

franklyd avatar Jan 10 '24 14:01 franklyd