valkey
valkey copied to clipboard
[NEW] Introduce slot level metrics to Redis cluster
I’m revisiting the feature proposal we discussed in https://github.com/redis/redis/issues/10472, which aims at providing metrics at the slot level. Despite the substantial effort and detailed discussions back then, we didn’t land this feature. I believe it’s worth reconsidering, given the potential benefits and previous interest.
@kyle-yh-kim @zuiderkwast @madolson
I fully agree!
I've also re-added my favorite creature comfort. @placeholderkv/core-team thoughts?
Sure, metrics seem fine. I don't have strong opinions about it, only that I think fixing the cluster consistency problems is more important than metrics.
https://github.com/redis/redis/pull/11432
I think our big initial play should be cluster overhaul. I think a lot of us want it, and it makes the most compelling sense as the big "next major features".
Good to hear back on this thread, hope you all have been doing well.
Where we left-off
In total, there were 3 proposed metrics under CLUSTER SLOT-STATS command group;
key_countcpu_usecmemory_bytes
Next steps
memory_bytes is the most complex of all, but this shouldn't stop us from implementing the first two metrics to gain some momentum.
I will open two PRs for key_count and cpu_usec in the coming days. These PRs will be based off of the already existing PRs for key_count and cpu_usec under Redis repository.
As for CLUSTER SLOT-STATS command format, below was the latest development we agreed upon. Lengthy discussion and rationale can be found here and here.
CLUSTER SLOT-STATS [SLOTSRANGE start-slot end-slot [start-slot end-slot ...]]| [ORDERBY column [LIMIT limit] [ASC|DESC]]
It is great, and I prefer to add this feature in CLUSTER INFO.
It is great, and I prefer to add this feature in CLUSTER INFO.
Why cluster info? It's a free form field I guess, it could be a new sub info field I suppose
Thanks for chiming in. Personally, I'm opposed to CLUSTER INFO. We could perhaps add aggregated information under CLUSTER INFO, but not for the slot level metrics themselves.
Imagine dumping ~16384 slot level metrics under CLUSTER INFO. This would unnecessarily bloat the info string, when the user may have only wanted to check cluster_state:ok.
A new command dedicated for slot level metrics querying, in this case, CLUSTER SLOT-STATS, is more suitable. For reference, below was the latest command format we agreed on.
CLUSTER SLOT-STATS
[SLOTSRANGE start-slot end-slot [start-slot end-slot ...]]|
[ORDERBY column [LIMIT limit] [ASC|DESC]]
I'll wait for the core team to finalize this decision, before opening the PRs.
@kyle-yh-kim Yeah CLUSTER SLOT-STATS. We're a bit overloaded with the forking stuff, new core team, new project, etc. but I think we want this for our next release. There was already a lot of review done and I think it was almost ready to merge. Do you want to bring over your PR?
Ignore my spam references above, I was reviewing the diff manually over Github UI.
PR has now been opened; https://github.com/valkey-io/valkey/pull/351
This PR is part one of the three upcoming PRs;
CLUSTER SLOT-STATScommand introduction, withkey-countsupport --> This PR.cpu-usecsupportmemory-bytessupport
Moving ahead, I would like to resume our conversation on per-slot memory metrics. I'd argue this is the most important per-slot metric of all, as it enables for smoother horizontal scaling given the accurate memory tracking at per-slot granularity.
Last time, we converged on its high-level strategy in "online analysis" (amortizing memory tracking cost per mutative-command, over offline RDB snapshot analysis / forking a process), as well as its performance and memory impact. The following conclusion was drawn, before the issue was then put on halt by the previously open-sourced Redis-core team.
Overall this data seems really good to me. There is the separate project for improving main memory efficiency of the dictionary, so if these two features are released together it might not be noticeable.
Source: https://github.com/redis/redis/issues/10472#issuecomment-1207603354
As for module consideration, I mention in details here to keep this feature as an opt-in service to maintain backwards compatibility. For opt-in modules, they will be required to accurately track its value size, and call a newly introduced hook RM_ModuleUpdateMemorySlotStats() upon its mutation, to signal valkey-server to register the memory size gain / loss from the module’s registered write commands.
If we are still aligned to this strategy, I will start on its implementation, and open incremental PRs following the merger of the above CLUSTER SLOT-STATS command PR https://github.com/valkey-io/valkey/pull/351.