Mooncake icon indicating copy to clipboard operation
Mooncake copied to clipboard

[RFC]: More KVCache metrics in both master/client side

Open Liziqi-77 opened this issue 2 months ago • 2 comments

Changes proposed

Motivation

As mentioned in How did SGLang HiCache with Mooncake Backend calculate cache hit ratio,Mooncake already has metrics to record some performance information, but it lacks relevant data for calculating cache hit rate. We would like to provide some preliminary ideas.

  • Before discussing the solution, I'd like to confirm a few points to ensure I understand correctly. Feel free to point out any errors.
  1. The cache hit rate in the graph provided in the paper refers to the entire system. For example, in the case of GPU+CPU+Mooncake, the number of token hits in these three components is divided by the total number of tokens in the prompt, not the hit rate in Mooncake alone.

  2. For the Inference/Serving System, the logic for searching the prefix cache is to first search in L1 (GPU). If no match is found, then search in L2 (CPU),then search in L3 (Mooncake). Therefore, for the framework, the overall cache hit rate can be calculated using data from these three components,The calculation formula is as follows:

    T = sum( The rate of each request ) / Total requests
    

Where T represents the system's cache hit rate.

Proposed Change

We designed a unified interface for the upper layer, adding the number of tokens hit in Mooncake and the total number of tokens in the kv pool to the metrics. The solution is as follows:

  • Event-driven: Every operation on Mooncake triggers an update to the metrics. The specific implementation logic is not expanded here.

  • The upper-level framework obtains the information provided by the interface to implement the logic of calculating the cache hit rate.

Welcome to discuss!

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues and read the documentation

Liziqi-77 avatar Oct 27 '25 12:10 Liziqi-77

Changes proposed

Motivation

As mentioned in How did SGLang HiCache with Mooncake Backend calculate cache hit ratio,Mooncake already has metrics to record some performance information, but it lacks relevant data for calculating cache hit rate. We would like to provide some preliminary ideas.

  • Before discussing the solution, I'd like to confirm a few points to ensure I understand correctly. Feel free to point out any errors.
  1. The cache hit rate in the graph provided in the paper refers to the entire system. For example, in the case of GPU+CPU+Mooncake, the number of token hits in these three components is divided by the total number of tokens in the prompt, not the hit rate in Mooncake alone.

Could you make sure the definition of this metric in mooncake paper? @chestnut-Q

  1. For the Inference/Serving System, the logic for searching the prefix cache is to first search in L1 (GPU). If no match is found, then search in L2 (CPU),then search in L3 (Mooncake). Therefore, for the framework, the overall cache hit rate can be calculated using data from these three components,The calculation formula is as follows:> ``` T = sum( The rate of each request ) / Total requests
    
    

Where T represents the system's cache hit rate.

That's correct, inference instance will check kvcache in L1-to-L3 order.

Proposed Change

We designed a unified interface for the upper layer, adding the number of tokens hit in Mooncake and the total number of tokens in the kv pool to the metrics. The solution is as follows:

  • Event-driven: Every operation on Mooncake triggers an update to the metrics. The specific implementation logic is not expanded here.
  • The upper-level framework obtains the information provided by the interface to implement the logic of calculating the cache hit rate.

Currently, mooncake already supports some metrics, you can add it (cache hit ratio) as a new API. Then, it can be integrated into sglang.

Welcome to discuss!

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues and read the documentation

stmatengss avatar Oct 28 '25 07:10 stmatengss

The cache hit rate in the graph provided in the paper refers to the entire system. For example, in the case of GPU+CPU+Mooncake, the number of token hits in these three components is divided by the total number of tokens in the prompt, not the hit rate in Mooncake alone.

Yes, that's right.

chestnut-Q avatar Oct 28 '25 07:10 chestnut-Q