valkey icon indicating copy to clipboard operation
valkey copied to clipboard

[NEW] Introduce slot level metrics to Redis cluster

Open PingXie opened this issue 1 year ago • 12 comments

I’m revisiting the feature proposal we discussed in https://github.com/redis/redis/issues/10472, which aims at providing metrics at the slot level. Despite the substantial effort and detailed discussions back then, we didn’t land this feature. I believe it’s worth reconsidering, given the potential benefits and previous interest.

@kyle-yh-kim @zuiderkwast @madolson

PingXie avatar Mar 25 '24 03:03 PingXie

I fully agree!

madolson avatar Mar 25 '24 03:03 madolson

I've also re-added my favorite creature comfort. @placeholderkv/core-team thoughts?

madolson avatar Mar 25 '24 03:03 madolson

Sure, metrics seem fine. I don't have strong opinions about it, only that I think fixing the cluster consistency problems is more important than metrics.

zuiderkwast avatar Mar 25 '24 10:03 zuiderkwast

https://github.com/redis/redis/pull/11432

zuiderkwast avatar Mar 25 '24 14:03 zuiderkwast

I think our big initial play should be cluster overhaul. I think a lot of us want it, and it makes the most compelling sense as the big "next major features".

madolson avatar Mar 25 '24 18:03 madolson

Good to hear back on this thread, hope you all have been doing well.

Where we left-off

In total, there were 3 proposed metrics under CLUSTER SLOT-STATS command group;

  1. key_count
  2. cpu_usec
  3. memory_bytes

Next steps

memory_bytes is the most complex of all, but this shouldn't stop us from implementing the first two metrics to gain some momentum.

I will open two PRs for key_count and cpu_usec in the coming days. These PRs will be based off of the already existing PRs for key_count and cpu_usec under Redis repository.

As for CLUSTER SLOT-STATS command format, below was the latest development we agreed upon. Lengthy discussion and rationale can be found here and here.

CLUSTER SLOT-STATS
[SLOTSRANGE start-slot end-slot [start-slot end-slot ...]]|
[ORDERBY column [LIMIT limit] [ASC|DESC]]

kyle-yh-kim avatar Mar 26 '24 15:03 kyle-yh-kim

It is great, and I prefer to add this feature in CLUSTER INFO.

hwware avatar Mar 27 '24 21:03 hwware

It is great, and I prefer to add this feature in CLUSTER INFO.

Why cluster info? It's a free form field I guess, it could be a new sub info field I suppose

madolson avatar Mar 27 '24 22:03 madolson

Thanks for chiming in. Personally, I'm opposed to CLUSTER INFO. We could perhaps add aggregated information under CLUSTER INFO, but not for the slot level metrics themselves.

Imagine dumping ~16384 slot level metrics under CLUSTER INFO. This would unnecessarily bloat the info string, when the user may have only wanted to check cluster_state:ok.

A new command dedicated for slot level metrics querying, in this case, CLUSTER SLOT-STATS, is more suitable. For reference, below was the latest command format we agreed on.

CLUSTER SLOT-STATS
[SLOTSRANGE start-slot end-slot [start-slot end-slot ...]]|
[ORDERBY column [LIMIT limit] [ASC|DESC]]

I'll wait for the core team to finalize this decision, before opening the PRs.

kyle-yh-kim avatar Mar 28 '24 00:03 kyle-yh-kim

@kyle-yh-kim Yeah CLUSTER SLOT-STATS. We're a bit overloaded with the forking stuff, new core team, new project, etc. but I think we want this for our next release. There was already a lot of review done and I think it was almost ready to merge. Do you want to bring over your PR?

zuiderkwast avatar Apr 15 '24 20:04 zuiderkwast

Ignore my spam references above, I was reviewing the diff manually over Github UI.

PR has now been opened; https://github.com/valkey-io/valkey/pull/351

This PR is part one of the three upcoming PRs;

  1. CLUSTER SLOT-STATS command introduction, with key-count support --> This PR.
  2. cpu-usec support
  3. memory-bytes support

kyle-yh-kim avatar Apr 23 '24 02:04 kyle-yh-kim

Moving ahead, I would like to resume our conversation on per-slot memory metrics. I'd argue this is the most important per-slot metric of all, as it enables for smoother horizontal scaling given the accurate memory tracking at per-slot granularity.

Last time, we converged on its high-level strategy in "online analysis" (amortizing memory tracking cost per mutative-command, over offline RDB snapshot analysis / forking a process), as well as its performance and memory impact. The following conclusion was drawn, before the issue was then put on halt by the previously open-sourced Redis-core team.

Overall this data seems really good to me. There is the separate project for improving main memory efficiency of the dictionary, so if these two features are released together it might not be noticeable.

Source: https://github.com/redis/redis/issues/10472#issuecomment-1207603354

As for module consideration, I mention in details here to keep this feature as an opt-in service to maintain backwards compatibility. For opt-in modules, they will be required to accurately track its value size, and call a newly introduced hook RM_ModuleUpdateMemorySlotStats() upon its mutation, to signal valkey-server to register the memory size gain / loss from the module’s registered write commands.

If we are still aligned to this strategy, I will start on its implementation, and open incremental PRs following the merger of the above CLUSTER SLOT-STATS command PR https://github.com/valkey-io/valkey/pull/351.

kyle-yh-kim avatar May 07 '24 17:05 kyle-yh-kim