lmdeploy [Performance] How many gains does Interactive Inference Mode contribute to the throughput?

[Performance] How many gains does Interactive Inference Mode contribute to the throughput?

Open nullxjx opened this issue 1 year ago • 1 comments

📚 The doc issue

I see one of the key features of lmdeploy is the "Interactive Inference Mode". As far as I know, most of the existing llm serving frameworks only record the kv cache during one dialogue, they usually do not record the cache cross multi dialogues. KV cache cross multi dialogues can avoid lots of repetitive caculation of self attention thus can improve system throughput (let's say tokens per second). So I'm pretty interested of how many gains does Interactive Inference Mode contribute to the throughput?

Suggest a potential alternative/fix

No response

Oct 17 '23 02:10 nullxjx

I'm also interested in this. glad if someone could help to answer

Jan 10 '24 14:01 franklyd

lmdeploy lmdeploy copied to clipboard

[Performance] How many gains does Interactive Inference Mode contribute to the throughput?

📚 The doc issue

Suggest a potential alternative/fix

lmdeploy
lmdeploy copied to clipboard