sglang
sglang copied to clipboard
Returning a per request metric for number of cached_tokens read
Motivation
Right now sglang uses an advanced radix cache system, but it is not possible to know for each request how many of the tokens were computed, or read from the cache. This pull request offer a way to easily return this information
Modifications
Most important edit is to the UsageInfo structure, adding cached_tokens
in prompt_tokens_details
the same way openai does it. To limit the impact to current users, this is disabled by default and needs to be activated using --activate-cache-report
server parameters.
For sglang /generate
requests, the number of cached tokens is always returned in the meta_info
dictionary
I'm not really sure about the extra parameter to the server, which is not implemented very elegantly. I'm open to suggestion to improve it. It could also be entirely removed and this report be activated by default (to avoid adding yet another server argument).
Checklist
- [x] Format your code according to the Contributor Guide.
- [x] Add unit tests as outlined in the Contributor Guide.
- [x] Update documentation as needed, including docstrings or example tutorials.