valkey icon indicating copy to clipboard operation
valkey copied to clipboard

[NEW] Introduce slot-level memory metrics

Open kyle-yh-kim opened this issue 1 year ago • 4 comments

This is a continuation of https://github.com/valkey-io/valkey/issues/20, for memory metrics.

The problem/use-case that the feature addresses

Through per-slot memory metrics, Valkey cluster users will be able to re-shard their cluster based on balanced memory usage. Specifically in scaling-in, per-slot memory metrics can be referred to ensure that the out-going slots would "fit" the target shard.

Description of the feature

Tracks per-slot memory metrics.


Alternatives you've considered

The per-slot memory problem can be broken down into two high-level design candidates; 1) Cron-job, and 2) Amortized calculation per-mutation, each with their low-level design candidates listed below.

1. Cron job

Upon user’s request (ex: through a new Valkey command SCAN-MEMORY), initialize a cron-job that scans over Valkey’s per-slot dictionary to calculate per-slot memory consumption.

Pros

  • [For Option 1A and Option 1C] Less invasive. The scan infrastructure and objectComputeSize() are already implemented. We’ll just have to latch on-top of this.

Cons

Contextually, cron-job does not fit well with the use-cases for the per-slot memory statistics. Consider the user whom needs to scale-in and out immediately. They would be required to first request for a cron-job, wait for it to finish (no upper bound), and the result will be partially stale and thus no longer accurate due to scan’s iterative nature. In summary;

  • The user needs to wait for the scanning to complete (no upper bound).
  • The data (especially for those slots iterated earlier) will be stale and less accurate.
  • [For Option 1B] Invasive changes due to multi-threaded consideration for the existing Scan infrastructure.

The following are cron job’s low-level candidates;

1A. [Cron-job] Main thread.

Similar to how it’s done with serverCron() and many others. Since serverCron() by default consumes 10 cycles per second, fast performance cannot be expected using the main thread.

1B. [Cron-job] Background thread.

Spawns a background thread instead. This will require multi-threaded support on the existing scan infrastructure, which will come at a high implementation complexity and performance overhead due to locking.

1C. [Cron-job] Fork process.

Simply fork the process, which is the fastest and easiest implementation. However, it will incur at worst x2 copy-on-write memory overhead, which is not ideal for users seeking to scale-in / out immediately.

2. Amortized

Calculates the difference in memory usage per-memory-mutation, and thus amortizing its cost. The CLUSTER SLOT-STATS will simply return values from the array where the calculation is already performed.

Pros

  • Immediate response. Simply return the already computed values.
  • Fresh / liveness and accuracy of data.

Cons

  • More invasive. Requires intricate changes to the memory-sparse data-structures (ex: dict.c, quicklist.c ... etc)[Table A1].
  • [Only for Option 2A] Incurs 8 bytes memory overhead from size_t field per-key, used to track the key-space size.

The following are amortized approach's low-level candidates;

2A. [Amortized] Track each key-space size per-mutation.

At every key-space mutation, track its key-size under each memory-sparse data-structure (ex: dict, quicklist) by a newly introduced field size_t. Then, its difference is aggregated again at per-slot level through hooks such as lookupKey() and afterCommand(). size_t is not necessary for memory-contiguous data-structures (ex: sds, listpack), as their size is already captured in the header.

2B. [Amortized] Use zmalloc intention and windows.

Separate all existing zmalloc calls into three intentions; 1) transient (temporary buffer), 2) internal (for internal book-keeping of Valkey server), and 3) user-data (where user’s data is actually stored). Then, for every zmalloc calls, specify its intention amongst the three options. Only the user-data is accumulated to the global per-slot counter.

2C. [Amortized] Tagging user-defined data-structure.

An improved version of Option 2B. For data-structures, we really just need to differentiate the user-data intention from the rest, by holding a new bit is_user_data under its header (ex: quicklist, sdshdr). If the bit is enabled, all its zmalloc calls will increment / decrement a global per-slot counter.

Appendix

Table A1. Memory sparse and memory contiguous data-structures in Valkey.

memory sparse data-structures memory contiguous data-structures
dict listpack
rax ziplist
quicklist intset
zskiplist sds

kyle-yh-kim avatar Jul 31 '24 17:07 kyle-yh-kim

@zvi-code I believe you wanted to comment on this.

madolson avatar Oct 07 '24 23:10 madolson

Hi @kyle-yh-kim ,

My name is Fred, and I was responsible for managing the valkey scalability aimed at balancing the load across shards. During this process, slot statistics play a crucial role in making optimal decisions, and as you pointed out, memory metrics are among the most critical factors to consider. I would like to share some of my thoughts on these metrics and would appreciate hearing your perspective on the matter.

Overall, while I agree that a 0.75% reduction in TPS is relatively minor, given that this is already an opt-in feature, I would like to propose an alternative approach. This approach could provide a rough estimation of slot size without adversely affecting performance.

TL; DR I propose calculating slot memory usage by sampling only large, sparse object types (skiplist, dict, for example), while excluding compact data types. This approach would help avoid performance penalties.

Motivation The current proposal, which involves instrumenting all key modifications, introduces a performance penalty. The proposed approach seeks to minimize the performance overhead while addressing specific use cases effectively.

  • Key Use Cases for Slot Statistics

    • Cluster Observability

      Administrators can opt in to monitor hot shards, slots, and keys, accepting minor performance trade-offs to gain valuable insights into cluster behavior. For this use case, the aforementioned accurate statistics are necessary, and it is acceptable to incur the performance penalty, given that this is an opt-in feature.

    • Slot Migration Reliability Before initiating a slot migration, administrators need to verify the size of the slot and the largest key within it to ensure adequate capacity for successful migration.

For the 2nd use case, sampling SHOULD be plenty to estimate the slot memory size because:

  • Most compact data structures have similar sizes, with sparse data structures contributing the most to slot memory usage variability.
  • Slot-level metrics remain relatively stable across multiple calls, offering opportunities for performance optimization.

By utilizing sampling method to calculate memory size, the impact on TPS will be negligible, as the engine will only sample large, sparse objects. Additionally, we will have gathered sufficient statistics to make optimal migration decisions

Please let me know your feedback, i could share some details as a follow up.

intelliu avatar Jan 23 '25 03:01 intelliu

Now, Redis have added slot-level memory metrics in their PR #14451. They have already matched our CLUSTER SLOT-STATS command with the fields matching our slot-stats: key-count, cpu-usec, network-bytes-in, network-bytes-out. The new CLUSTER SLOT-STATS field they're adding now is memory-bytes. It builds on the Redis PR for per-key memory accounting #14363.

zuiderkwast avatar Oct 28 '25 13:10 zuiderkwast