valkey Add framework for cross-slot commands, enable for MGET

Introduces the CMD_CROSS_SLOT flag to allow specific commands to operate on keys residing in multiple slots, provided those slots are managed by the same shard.

Key changes:

Added CMD_CROSS_SLOT command flag and corresponding cross_slot client flag.
Modified getNodeByQuery in cluster.c to:
- Check for the CMD_CROSS_SLOT flag.
- Allow keys in different slots if they belong to the same node.
- Introduce CLUSTER_REDIR_CROSS_SHARD error if keys span multiple nodes.
- Prevent cross-slot operations if any involved slot is migrating/importing.
Introduced lazy_expire_disabled client flag to prevent lazy expiration during cross-slot command execution.
Updated MGET command definition to use the new CMD_CROSS_SLOT flag.
Adjusted related logic (client reset, script error handling, key slot checks) to accommodate the new flags and behavior.

This lays the groundwork for easily enabling cross-slot semantics for other commands in the future while enabling it for MGET immediately.

Apr 22 '25 05:04 PingXie

This PR is based on the following feedback provided by the community:

Cross-slots MGET has significant performance boost (validated);
Cross-slots WRITE is messy and not interesting;
Value of supporting cross-slots for commands other than MGET isn't clear;
Cross-slots READ shouldn't have the WRITE side effect because doing so complicates the replication;
We can start with MGET but it should be easy to extend it to other commands if needed without significantly increasing the command surface;

Note that this PR will reject an otherwise valid cross-slot MGET command if one of its slots is being migrated but when that is only for the legacy slot migration. It will work transparently with the planned atomic slot migration (#23).

Apr 22 '25 06:04 PingXie

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 71.04%. Comparing base (5df95b3) to head (15047ce). Report is 2 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1986      +/-   ##
============================================
- Coverage     71.04%   71.04%   -0.01%     
============================================
  Files           123      123              
  Lines         66113    66135      +22     
============================================
+ Hits          46972    46987      +15     
- Misses        19141    19148       +7

Files with missing lines	Coverage Δ
src/cluster.c	`90.66% <100.00%> (+0.64%)`	:arrow_up:
src/commands.def	`100.00% <ø> (ø)`
src/db.c	`89.58% <100.00%> (+<0.01%)`	:arrow_up:
src/networking.c	`87.30% <100.00%> (+<0.01%)`	:arrow_up:
src/script.c	`87.69% <ø> (ø)`
src/server.c	`87.94% <100.00%> (+0.03%)`	:arrow_up:
src/server.h	`100.00% <ø> (ø)`

... and 14 files with indirect coverage changes

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
:package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Apr 22 '25 06:04 codecov[bot]

@valkey-io/core-team PTAL

Apr 22 '25 06:04 PingXie

I just take a short look this PR, my question is how client enables and disable this feature? Or it is enabled by default?

Apr 22 '25 18:04 hwware

Explicit requirements

performance is the goal; atomicity is not required
MGET is the only command of interest for now; however, we should be able to introduce cross-slot to other "read" commands easily with this work
"write" is not desired/needed in the foreseeable future but the work in this PR should allow the support for cross-slot "write" commands to be introduced in a compatible experience for developers
need to support module
need to be compatible with existing engine features, including slot level metrics and memory prefetching

High level decisions for discussion

expanding MGET vs introducing a new command?
if new command, what should the new command be called? how to create an easily identifiable naming pattern to help reduce the learning curve for the developers?
should we give the client a say about whether it'd like to work with unstable slots or not?
if we support unstable slots, what should be returned if the requested key is not found in the unstable slot? an error or {}? if error, what data type should be used?
in the redirection case, should we introduce a new -ASK/-MOVED error that allows all the slots in question to be returned? otherwise, the client can update one slot's owning shard at a time upon receiving the existing -ASK/-MOVED error. it would take the client N retries to catch up with the cluster topology updates, where N is the number of slots in the original cross-slot request.
how do we divide up the resources, cpu/network/memory, consumed by a cross-slot mget?

@valkey-io/core-team

Apr 28 '25 03:04 PingXie

I did some more digging on pipelining. I think the core issue is that only the first command in a pipeline batch benefits from the memory prefetching. The remaining ones are all run with cold cache on the main thread (and get parsed on the main thread as well).

this is the flame chart that illustrates these two points, indirectly, 1) the presence of processInputBuffer on the main thread; 2) the much wider margin on the io threads, indicating that io threads are merely burning cycles without doing real work.

Here is the mget flamechart for comparison.

I will see if I can get the more definitive (v)PMU counters next.

May 05 '25 02:05 PingXie

Thanks Ping for the analysis!

The current IO threads design is indeed not optimized for pipelining. It parses one command from each client and hands it over to the main thread. The main thread executes the parsed command and then, if there is more in the input buffer, the main thread parses it and executes the rest of the pipeline. We can see this in processPendingCommandAndInputBuffer().

In processInputBuffer() there is a while-loop that reads, parses and executes all the available commands in the pipeline. I believe this explains what you see in the flame graph.

If IO-threads are disabled, that the main thread prioritizes pipelines and doesn't interleave it with commands from other clients, which is good for pipelines and transactions, but probably still not good enough to compete with cross-slot-MGET.

There is a follow-up idea for IO-threading to let the IO thread parse more than one command: https://github.com/valkey-io/valkey/issues/761, the bullet "Parsing offload to parse multiple commands". I created a specific issue for it now:

https://github.com/valkey-io/valkey/issues/2044

Even with IO-threads disabled, we can do "look-ahead" of pipelined commands by peeking into the input buffer. If we detect that there is another GET or MGET command in the pipeline, we could parse it and prefetch the keys while we execute the current command.

There is a feature plan to allow executing readonly commands in parallel in different threads if the commands are operating on different slots: https://github.com/valkey-io/valkey/issues/2022. It takes advantage of the fact that each command operates on a single slot.

May 05 '25 10:05 zuiderkwast

The key contributor here is actually the memory perfecting. In my test, with enabling memory prefetching just for the first slot in a cross-slot MGET command, which is what happens in this PR, its performance (in terms of keys per second) is pretty much on par with that of pipelined GETs. That being said, you are right that being able to parse more than more commands is the prerequisite to the effective prefetching.

BTW, I wonder if we really need to pay for the complexity of "look-ahead" - I feel that all we need is to parse the entire qbuf into a command queue maintained on the client struct and then ensure the memory prefetching logic pop one command off of that queue one at a time.

May 05 '25 19:05 PingXie

BTW, I wonder if we really need to pay for the complexity of "look-ahead" - I feel that all we need is to parse the entire qbuf into a command queue maintained on the client struct and then ensure the memory prefetching logic pop one command off of that queue one at a time.

I also wonder if we need to parse the whole query buffer, or just the next command. If we prefetch the data for the next command while we execute the current one, it might resolve to similar amounts of time.

May 05 '25 19:05 madolson

BTW, I wonder if we really need to pay for the complexity of "look-ahead" - I feel that all we need is to parse the entire qbuf into a command queue maintained on the client struct and then ensure the memory prefetching logic pop one command off of that queue one at a time.

Yeah, this is the better solution. It may require some more refactoring of the parsing code though.

I also wonder if we need to parse the whole query buffer, or just the next command. If we prefetch the data for the next command while we execute the current one, it might resolve to similar amounts of time.

To get maximum benefit, I believe we should parse a few commands ahead, so we can prefetch a batch of ~16 keys, whether it's a single MGET command or 16 GET/SET/HSET/SADD commands.

The prefetching is an iterative process of 3-4 steps. After doing the first step on each key lookup (e.g. prefetching the hashtable bucket), we need to wait for the memory to be loaded before we can do the next step (e.g. prefetching the key of a potential match before we check if it's a match). When we prefetch in batches, we pay the cost of the cold memory access once per batch instead of once per key. But we can also gain if we can always be a few steps ahead of the execution: When we execute a command, we do one step of prefetching on keys of the commands that are waiting in the queue. In this way, we can have everything prefetched if we just prefetch 3-4 commands ahead.

If you look at the recent optimization of the hashtable iterator, it uses this technique. It prefetches a few steps ahead of the element returned by hashtableNext(). See #1501.

May 05 '25 20:05 zuiderkwast

BTW, I wonder if we really need to pay for the complexity of "look-ahead" - I feel that all we need is to parse the entire qbuf into a command queue maintained on the client struct and then ensure the memory prefetching logic pop one command off of that queue one at a time.

Maybe we should be thinking of the entire system as more like a pipeline of commands, just like the CPU does. The main command execution thread has a series of commands it needs to execute in various stages of prepared to execute:

Head: SET FOO BAR; FOO has hash bucket and serverObject prefetched. Execute!
Queued Command1: GET FOO2; FOO2 has hash bucket prefetched. Prefetch server Object.
Queued Command2: GET FOO3; FOO3 has nothing prefetched. Prefetch hash bucket.
Queued Command3: GET FOO4; Foo4 has nothing prefetched. Pipeline is full, don't start processing this command yet.

Complex operations like multi-exec might need more prefetching stages. The actual set of commands could still be maintained on the client.

May 05 '25 21:05 madolson

@PingXie Do I get it right - you are abandonning idea of cross-slot framework for MGET because you found the way to make single GET to work faster?

May 26 '25 14:05 dmytro-arkhypenko

@dmytro-arkhypenko yes, and/or use one MGET per slot, if they are pipelined. There are now two PRs for the same idea actually. Do you want to test this one? Preliminary results is it's makes pipelined commands twice as fast:

#2092
#2044 (the initial idea discussed)

May 26 '25 18:05 zuiderkwast

@dmytro-arkhypenko yes, and/or use one MGET per slot, if they are pipelined. There are now two PRs for the same idea actually. Do you want to test this one?

Well might test this definitely, but isn't multislot command (not just MGET) potentially reduces the number of responses from 16k+(number of slots) to whatever number of shards you got and you can't probably beat it by x2 speedup.

I was hopefull this PR would come as a solution to #507 and a bit said it is not.

May 28 '25 10:05 dmytro-arkhypenko

valkey valkey copied to clipboard

Add framework for cross-slot commands, enable for MGET

Codecov Report

Explicit requirements

High level decisions for discussion

valkey
valkey copied to clipboard