Add prefix cache stats to usage
Motivation
https://github.com/InternLM/lmdeploy/issues/1942
Modification
Log prefix cache statistics.
Checklist
- Pre-commit or other linting tools are used to fix the potential lint issues.
- The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
- If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
- The documentation has been modified accordingly, like docstring or example tutorials.
@zhulinJulia24 The unit-test and pr_ete_test failed because No space left on device. May you help fix this? Thanks.
@zhulinJulia24 The unit-test and pr_ete_test failed because
No space left on device. May you help fix this? Thanks.
cc @zhulinJulia24
/usr/bin/ld: final link failed: No space left on device
@lvhan028 @AllentDan Could you help review this PR?Do you have any suggestions for the API changes mentioned in https://github.com/InternLM/lmdeploy/pull/2018#discussion_r1677097726?
Hi, @ispobock
I don't think adding logs meets the requirement of @josephrocca
LMDeploy returns usage to the client when the generation is completed, including "prompt_tokens", "completion_tokens" and 'total_tokens'
Perhaps we can extend the output tensors by adding statistics information about prefix_caching.
std::unordered_map<std::string, ft::Tensor> output_tensors = std::unordered_map<std::string, ft::Tensor>{
{"output_ids",
ft::Tensor{ft::MEMORY_CPU,
ft::TYPE_UINT32,
std::vector<size_t>{request_batch_size, beam_width, (size_t)instance_->session_len},
d_output_ids_}},
{"sequence_length",
ft::Tensor{ft::MEMORY_CPU,
ft::TYPE_UINT32,
std::vector<size_t>{request_batch_size, beam_width},
d_sequence_lengths_}}};
@lvhan028 @josephrocca The prefix_cached_tokens is added to the usage (ref the following example). Please help check and review.
server:
lmdeploy serve api_server /workdir/llama2_13b_chat --server-name 0.0.0.0 --server-port 23333 --tp 1 --enable-prefix-caching
request:
curl -X POST http://0.0.0.0:23333/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/workdir/llama2_13b_chat",
"prompt": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you do not know the answer to a question, please do not share false information. Please introduce yourself.",
"max_tokens": 50,
"temperature": 1
}'
response:
{"id":"4","object":"text_completion","created":1721867570,"model":"/workdir/llama2_13b_chat","choices":[{"index":0,"text":"\n\n Hello! I'm just an AI, I'm here to help answer any questions you may have. My name is LLaMA, and I'm here to assist you in a helpful, respectful, and honest manner.","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":117,"total_tokens":168,"completion_tokens":51,"prefix_cached_tokens":64}}
That's great @ispobock! Thank you! I use stream:true for my use case - is the same usage data added to the final event stream JSON line too, like with OpenAI's API?
@josephrocca Added usage to final stream response, pls check again
Hi, All, given we are at a breaking point #2090, I am going to postpone this PR until #2090 is merged I am sorry for the inconvenience
大家好,鉴于我们正处于临界点#2090,我将推迟此 PR,直到#2090合并为止, 对于给您带来的不便,我深表歉意
这个有计划发布吗?
Hi @lvhan028,sorry for the delay. This PR is ready to review.