sglang icon indicating copy to clipboard operation
sglang copied to clipboard

Returning a per request metric for number of cached_tokens read

Open havetc opened this issue 4 months ago • 0 comments

Motivation

Right now sglang uses an advanced radix cache system, but it is not possible to know for each request how many of the tokens were computed, or read from the cache. This pull request offer a way to easily return this information

Modifications

Most important edit is to the UsageInfo structure, adding cached_tokens in prompt_tokens_details the same way openai does it. To limit the impact to current users, this is disabled by default and needs to be activated using --activate-cache-report server parameters. For sglang /generate requests, the number of cached tokens is always returned in the meta_info dictionary

I'm not really sure about the extra parameter to the server, which is not implemented very elegantly. I'm open to suggestion to improve it. It could also be entirely removed and this report be activated by default (to avoid adding yet another server argument).

Checklist

  • [x] Format your code according to the Contributor Guide.
  • [x] Add unit tests as outlined in the Contributor Guide.
  • [x] Update documentation as needed, including docstrings or example tutorials.

havetc avatar Oct 07 '24 10:10 havetc