valkey
valkey copied to clipboard
Add framework for cross-slot commands, enable for MGET
Introduces the CMD_CROSS_SLOT flag to allow specific commands to operate on keys residing in multiple slots, provided those slots are managed by the same shard.
Key changes:
- Added
CMD_CROSS_SLOTcommand flag and correspondingcross_slotclient flag. - Modified
getNodeByQueryincluster.cto:- Check for the
CMD_CROSS_SLOTflag. - Allow keys in different slots if they belong to the same node.
- Introduce
CLUSTER_REDIR_CROSS_SHARDerror if keys span multiple nodes. - Prevent cross-slot operations if any involved slot is migrating/importing.
- Check for the
- Introduced
lazy_expire_disabledclient flag to prevent lazy expiration during cross-slot command execution. - Updated
MGETcommand definition to use the newCMD_CROSS_SLOTflag. - Adjusted related logic (client reset, script error handling, key slot checks) to accommodate the new flags and behavior.
This lays the groundwork for easily enabling cross-slot semantics for other commands in the future while enabling it for MGET immediately.
This PR is based on the following feedback provided by the community:
- Cross-slots
MGEThas significant performance boost (validated); - Cross-slots
WRITEis messy and not interesting; - Value of supporting cross-slots for commands other than
MGETisn't clear; - Cross-slots
READshouldn't have theWRITEside effect because doing so complicates the replication; - We can start with
MGETbut it should be easy to extend it to other commands if needed without significantly increasing the command surface;
Note that this PR will reject an otherwise valid cross-slot MGET command if one of its slots is being migrated but when that is only for the legacy slot migration. It will work transparently with the planned atomic slot migration (#23).
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 71.04%. Comparing base (
5df95b3) to head (15047ce). Report is 2 commits behind head on unstable.
Additional details and impacted files
@@ Coverage Diff @@
## unstable #1986 +/- ##
============================================
- Coverage 71.04% 71.04% -0.01%
============================================
Files 123 123
Lines 66113 66135 +22
============================================
+ Hits 46972 46987 +15
- Misses 19141 19148 +7
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/cluster.c | 90.66% <100.00%> (+0.64%) |
:arrow_up: |
| src/commands.def | 100.00% <ø> (ø) |
|
| src/db.c | 89.58% <100.00%> (+<0.01%) |
:arrow_up: |
| src/networking.c | 87.30% <100.00%> (+<0.01%) |
:arrow_up: |
| src/script.c | 87.69% <ø> (ø) |
|
| src/server.c | 87.94% <100.00%> (+0.03%) |
:arrow_up: |
| src/server.h | 100.00% <ø> (ø) |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
@valkey-io/core-team PTAL
I just take a short look this PR, my question is how client enables and disable this feature? Or it is enabled by default?
Explicit requirements
- performance is the goal; atomicity is not required
MGETis the only command of interest for now; however, we should be able to introduce cross-slot to other "read" commands easily with this work- "write" is not desired/needed in the foreseeable future but the work in this PR should allow the support for cross-slot "write" commands to be introduced in a compatible experience for developers
- need to support module
- need to be compatible with existing engine features, including slot level metrics and memory prefetching
High level decisions for discussion
- expanding
MGETvs introducing a new command? - if new command, what should the new command be called? how to create an easily identifiable naming pattern to help reduce the learning curve for the developers?
- should we give the client a say about whether it'd like to work with unstable slots or not?
- if we support unstable slots, what should be returned if the requested key is not found in the unstable slot? an error or {}? if error, what data type should be used?
- in the redirection case, should we introduce a new
-ASK/-MOVEDerror that allows all the slots in question to be returned? otherwise, the client can update one slot's owning shard at a time upon receiving the existing-ASK/-MOVEDerror. it would take the client N retries to catch up with the cluster topology updates, where N is the number of slots in the original cross-slot request. - how do we divide up the resources, cpu/network/memory, consumed by a cross-slot mget?
@valkey-io/core-team
I did some more digging on pipelining. I think the core issue is that only the first command in a pipeline batch benefits from the memory prefetching. The remaining ones are all run with cold cache on the main thread (and get parsed on the main thread as well).
this is the flame chart that illustrates these two points, indirectly, 1) the presence of processInputBuffer on the main thread; 2) the much wider margin on the io threads, indicating that io threads are merely burning cycles without doing real work.
Here is the mget flamechart for comparison.
I will see if I can get the more definitive (v)PMU counters next.
Thanks Ping for the analysis!
The current IO threads design is indeed not optimized for pipelining. It parses one command from each client and hands it over to the main thread. The main thread executes the parsed command and then, if there is more in the input buffer, the main thread parses it and executes the rest of the pipeline. We can see this in processPendingCommandAndInputBuffer().
In processInputBuffer() there is a while-loop that reads, parses and executes all the available commands in the pipeline. I believe this explains what you see in the flame graph.
If IO-threads are disabled, that the main thread prioritizes pipelines and doesn't interleave it with commands from other clients, which is good for pipelines and transactions, but probably still not good enough to compete with cross-slot-MGET.
There is a follow-up idea for IO-threading to let the IO thread parse more than one command: https://github.com/valkey-io/valkey/issues/761, the bullet "Parsing offload to parse multiple commands". I created a specific issue for it now:
- https://github.com/valkey-io/valkey/issues/2044
Even with IO-threads disabled, we can do "look-ahead" of pipelined commands by peeking into the input buffer. If we detect that there is another GET or MGET command in the pipeline, we could parse it and prefetch the keys while we execute the current command.
There is a feature plan to allow executing readonly commands in parallel in different threads if the commands are operating on different slots: https://github.com/valkey-io/valkey/issues/2022. It takes advantage of the fact that each command operates on a single slot.
The key contributor here is actually the memory perfecting. In my test, with enabling memory prefetching just for the first slot in a cross-slot MGET command, which is what happens in this PR, its performance (in terms of keys per second) is pretty much on par with that of pipelined GETs. That being said, you are right that being able to parse more than more commands is the prerequisite to the effective prefetching.
BTW, I wonder if we really need to pay for the complexity of "look-ahead" - I feel that all we need is to parse the entire qbuf into a command queue maintained on the client struct and then ensure the memory prefetching logic pop one command off of that queue one at a time.
BTW, I wonder if we really need to pay for the complexity of "look-ahead" - I feel that all we need is to parse the entire qbuf into a command queue maintained on the client struct and then ensure the memory prefetching logic pop one command off of that queue one at a time.
I also wonder if we need to parse the whole query buffer, or just the next command. If we prefetch the data for the next command while we execute the current one, it might resolve to similar amounts of time.
BTW, I wonder if we really need to pay for the complexity of "look-ahead" - I feel that all we need is to parse the entire qbuf into a command queue maintained on the client struct and then ensure the memory prefetching logic pop one command off of that queue one at a time.
Yeah, this is the better solution. It may require some more refactoring of the parsing code though.
I also wonder if we need to parse the whole query buffer, or just the next command. If we prefetch the data for the next command while we execute the current one, it might resolve to similar amounts of time.
To get maximum benefit, I believe we should parse a few commands ahead, so we can prefetch a batch of ~16 keys, whether it's a single MGET command or 16 GET/SET/HSET/SADD commands.
The prefetching is an iterative process of 3-4 steps. After doing the first step on each key lookup (e.g. prefetching the hashtable bucket), we need to wait for the memory to be loaded before we can do the next step (e.g. prefetching the key of a potential match before we check if it's a match). When we prefetch in batches, we pay the cost of the cold memory access once per batch instead of once per key. But we can also gain if we can always be a few steps ahead of the execution: When we execute a command, we do one step of prefetching on keys of the commands that are waiting in the queue. In this way, we can have everything prefetched if we just prefetch 3-4 commands ahead.
If you look at the recent optimization of the hashtable iterator, it uses this technique. It prefetches a few steps ahead of the element returned by hashtableNext(). See #1501.
BTW, I wonder if we really need to pay for the complexity of "look-ahead" - I feel that all we need is to parse the entire qbuf into a command queue maintained on the client struct and then ensure the memory prefetching logic pop one command off of that queue one at a time.
Maybe we should be thinking of the entire system as more like a pipeline of commands, just like the CPU does. The main command execution thread has a series of commands it needs to execute in various stages of prepared to execute:
Head: SET FOO BAR; FOO has hash bucket and serverObject prefetched. Execute!
Queued Command1: GET FOO2; FOO2 has hash bucket prefetched. Prefetch server Object.
Queued Command2: GET FOO3; FOO3 has nothing prefetched. Prefetch hash bucket.
Queued Command3: GET FOO4; Foo4 has nothing prefetched. Pipeline is full, don't start processing this command yet.
Complex operations like multi-exec might need more prefetching stages. The actual set of commands could still be maintained on the client.
@PingXie Do I get it right - you are abandonning idea of cross-slot framework for MGET because you found the way to make single GET to work faster?
@dmytro-arkhypenko yes, and/or use one MGET per slot, if they are pipelined. There are now two PRs for the same idea actually. Do you want to test this one? Preliminary results is it's makes pipelined commands twice as fast:
- #2092
- #2044 (the initial idea discussed)
@dmytro-arkhypenko yes, and/or use one MGET per slot, if they are pipelined. There are now two PRs for the same idea actually. Do you want to test this one?
Well might test this definitely, but isn't multislot command (not just MGET) potentially reduces the number of responses from 16k+(number of slots) to whatever number of shards you got and you can't probably beat it by x2 speedup.
I was hopefull this PR would come as a solution to #507 and a bit said it is not.