lmdeploy icon indicating copy to clipboard operation
lmdeploy copied to clipboard

Add prefix cache stats to usage

Open ispobock opened this issue 1 year ago • 10 comments

Motivation

https://github.com/InternLM/lmdeploy/issues/1942

Modification

Log prefix cache statistics.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

ispobock avatar Jul 13 '24 09:07 ispobock

@zhulinJulia24 The unit-test and pr_ete_test failed because No space left on device. May you help fix this? Thanks.

zhyncs avatar Jul 13 '24 10:07 zhyncs

@zhulinJulia24 The unit-test and pr_ete_test failed because No space left on device. May you help fix this? Thanks.

cc @zhulinJulia24

zhyncs avatar Jul 15 '24 08:07 zhyncs

/usr/bin/ld: final link failed: No space left on device

zhyncs avatar Jul 15 '24 08:07 zhyncs

@lvhan028 @AllentDan Could you help review this PR?Do you have any suggestions for the API changes mentioned in https://github.com/InternLM/lmdeploy/pull/2018#discussion_r1677097726?

ispobock avatar Jul 21 '24 03:07 ispobock

Hi, @ispobock I don't think adding logs meets the requirement of @josephrocca LMDeploy returns usage to the client when the generation is completed, including "prompt_tokens", "completion_tokens" and 'total_tokens' Perhaps we can extend the output tensors by adding statistics information about prefix_caching.

std::unordered_map<std::string, ft::Tensor> output_tensors = std::unordered_map<std::string, ft::Tensor>{
      {"output_ids",
       ft::Tensor{ft::MEMORY_CPU,
                  ft::TYPE_UINT32,
                  std::vector<size_t>{request_batch_size, beam_width, (size_t)instance_->session_len},
                  d_output_ids_}},
      {"sequence_length",
       ft::Tensor{ft::MEMORY_CPU,
                  ft::TYPE_UINT32,
                  std::vector<size_t>{request_batch_size, beam_width},
                  d_sequence_lengths_}}};

lvhan028 avatar Jul 22 '24 10:07 lvhan028

@lvhan028 @josephrocca The prefix_cached_tokens is added to the usage (ref the following example). Please help check and review.

server:

lmdeploy serve api_server /workdir/llama2_13b_chat --server-name 0.0.0.0 --server-port 23333 --tp 1 --enable-prefix-caching

request:

curl -X POST http://0.0.0.0:23333/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/workdir/llama2_13b_chat",
"prompt": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you do not know the answer to a question, please do not share false information. Please introduce yourself.",
"max_tokens": 50,
"temperature": 1
}'

response:

{"id":"4","object":"text_completion","created":1721867570,"model":"/workdir/llama2_13b_chat","choices":[{"index":0,"text":"\n\n Hello! I'm just an AI, I'm here to help answer any questions you may have. My name is LLaMA, and I'm here to assist you in a helpful, respectful, and honest manner.","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":117,"total_tokens":168,"completion_tokens":51,"prefix_cached_tokens":64}}

ispobock avatar Jul 25 '24 00:07 ispobock

That's great @ispobock! Thank you! I use stream:true for my use case - is the same usage data added to the final event stream JSON line too, like with OpenAI's API?

josephrocca avatar Jul 25 '24 13:07 josephrocca

@josephrocca Added usage to final stream response, pls check again

ispobock avatar Jul 25 '24 15:07 ispobock

Hi, All, given we are at a breaking point #2090, I am going to postpone this PR until #2090 is merged I am sorry for the inconvenience

lvhan028 avatar Aug 09 '24 06:08 lvhan028

大家好,鉴于我们正处于临界点#2090,我将推迟此 PR,直到#2090合并为止, 对于给您带来的不便,我深表歉意

这个有计划发布吗?

akai-shuuichi avatar Sep 03 '24 09:09 akai-shuuichi

Hi @lvhan028,sorry for the delay. This PR is ready to review.

ispobock avatar Nov 26 '24 02:11 ispobock