valkey Parse multiple commands and prefetch keys

Instead of parsing only one command per client before executing it, parse multiple commands from the query buffer and batch-prefetch the keys accessed by the commands in the queue before executing them.

This is an optimization for pipelined commands, both with and without I/O threads. The replication stream is pipelined so it may also benefit from this.

When parsing commands from the input buffer, multiple commands are parsed and stored in a command queue per client.
In single-threaded mode (I/O threads off) keys are batch-prefetched before the commands in the queue are executed. Multi-key commands like MGET, MSET and DEL benefit from this even if pipelining is not used.
I/O threads offload command lookup and cluster slot lookup for each command in the queue.
Prefetching when I/O threads are used does prefetching for multiple clients in parallel. This code takes client command queues into account, improving prefetching when pipelining is used.
When I/O threads are used and the maximum number of keys are prefetched (default 16), a client's command is executed, then the next one in the queue, etc. If there are more commands in the queue for which the keys have not been prefetched (say the client sends 16 pipelined MGET with 16 keys in each) keys for the next few commands in the queue are prefetched before the commands is executed if prefetching has not been done for the next command. (This utilizes the code path used in single-threaded mode.)

Code improvements:

Decoupling of command parser state and command execution state:
- The variables reqtype, multibulklen and bulklen refer to the current position in the query buffer. These are no longer reset in resetClient (which runs after each command being executed). Instead, they are reset in the parser code after each completely parsed command.
- The command parser code is partially decoupled from the client struct. The query buffer is still one per client, but the resulting argument vector is stored in caller-defined variables.

Benchmarking with and without I/O threads, see below.

Fixes #2044

May 16 '25 12:05 zuiderkwast

Codecov Report

:x: Patch coverage is 86.91099% with 25 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 72.21%. Comparing base (ac51515) to head (a4ad52e). :warning: Report is 1 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/memory_prefetch.c	0.00%	18 Missing :warning:
src/networking.c	96.73%	5 Missing :warning:
src/server.c	90.00%	2 Missing :warning:

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2092      +/-   ##
============================================
+ Coverage     72.02%   72.21%   +0.19%     
============================================
  Files           126      126              
  Lines         70493    70616     +123     
============================================
+ Hits          50771    50994     +223     
+ Misses        19722    19622     -100

Files with missing lines	Coverage Δ
src/kvstore.c	`96.01% <ø> (ø)`
src/module.c	`9.52% <ø> (ø)`
src/server.h	`100.00% <ø> (ø)`
src/server.c	`88.44% <90.00%> (-0.03%)`	:arrow_down:
src/networking.c	`88.28% <96.73%> (+0.36%)`	:arrow_up:
src/memory_prefetch.c	`11.26% <0.00%> (-0.77%)`	:arrow_down:

... and 15 files with indirect coverage changes

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
:package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

May 16 '25 12:05 codecov[bot]

The overall approach makes sense to me, I have some minor thoughts but I haven't thought to much how to think about this vs Ping's PR yet.

Thanks @madolson for taking the time to look at this.

Looking back at the original IO threading, we batch N keys, maybe we should change it to be batch N commands instead, since that will make sure we are fully parsing a command. That way we can probably more tightly unify the code.

I think we shall prefetch full commands, but we can still configure it as the number of keys. If we go over the number N when adding a command to the batch, then we can allow prefetching a few more, let's say up to N*2. That way we prefetch by default 16 keys, but allow prefetching up to 32 keys to stick to full commands.

We should definitely unify the IO-threaded and single-threaded code paths for this code and more of the parse-lookup-prefetch-execute. I avoided refactoring to keep the diff small but I think it's necessary to at least rename some functions to keep this comprehensive.

May 28 '25 12:05 zuiderkwast

Looking back at the original IO threading, we batch N keys, maybe we should change it to be batch N commands instead, since that will make sure we are fully parsing a command. That way we can probably more tightly unify the code.

I think we shall prefetch full commands, but we can still configure it as the number of keys. If we go over the number N when adding a command to the batch, then we can allow prefetching a few more, let's say up to N*2. That way we prefetch by default 16 keys, but allow prefetching up to 32 keys to stick to full commands.

my preference, for the sake of simplicity, is that we continue to count keys. as soon as we use up max_pretech_size we should just execute the command, irrespective of the number of the keys referenced by the command (e.g., mget) or the number of commands batached.

ultimately max_prefetch_size is determined by the L1 cache size. There is only so much we can prefetch. over-prefetching will lead to L1 cache thrashing, nullifying the effect.

the L1 data cache on Intel cpus is 32 KiB or 48 KiB, typically, with a cache line size of 64 bytes, which gives us 512 or 768 cache lines, respectively. I haven't counted carefully the number of memory accesses needed to prefetch a key-value pair of strings in the new hashtable but I guess it would be 5 (including both keys and hash table)? Assuming 5, prefetching 32 keys would consume 160 cache lines, which is not a trivial number compared to 512 already.

May 29 '25 06:05 PingXie

Hello, it seems that this pull request has some conflicts with the unstable branch. Do you need any assistance working on it? Merging it into 9.0 would be beneficial for many users (a lot of clients use pipelines automatically).

Jul 08 '25 11:07 secwall

After discussion with @PingXie and others, I wanted to rewrite it a bit and unify the prefetching code paths for io-threaded and single-threaded modes. I probably can't make it to 9.0 though.

How would folks (@PingXie, @madolson) feel about merging this more or less as is for 9.0 and improving it later?

Jul 08 '25 12:07 zuiderkwast

How would folks (@PingXie, @madolson) feel about merging this more or less as is for 9.0 and improving it later?

works for me. Will review this pr next.

Jul 09 '25 06:07 PingXie

Hello. I apologize for the inconvenience, but the 9.0 feature freeze is imminent. How can I assist in ensuring that this PR is included in the 9.0 version?

Jul 28 '25 08:07 secwall

Re-did some benchmarks on currently supported versions. The most important conclusion: 9.0 without this PR will be 3rd consecutive version that is significantly slower than 7.2 with enabled pipelining. I think it is important that we make sure this PR gets into 9.0 because valkey-go does pipelining by default. Config:

databases 1
save ""
appendonly no
rdbcompression no
activedefrag no
maxclients 1000
io-threads 9
protected-mode no
maxmemory 2500mb
maxmemory-policy noeviction
hz 1
latency-monitor-threshold 0

cset shield
cpu bind
cache population: valkey-benchmark -h <ip> -c 256 --threads 64 -d 256 -t set -n 3000000 -r 3000000
benchmark: valkey-benchmark -h <ip> -c 196 --threads 48 -d 256 -t get -n 30000000 -r 3000000 -P (1 or 16)

Results: 7.2: Pipe 1

Summary:
  throughput summary: 204579.86 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.937     0.240     0.887     1.079     1.719  3000.319

Pipe 16

Summary:
  throughput summary: 949307.00 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.984     0.168     3.031     3.359     3.479     4.631

8.0: Pipe 1

Summary:
  throughput summary: 590876.88 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.325     0.032     0.335     0.455     0.511     2.687

Pipe 16

Summary:
  throughput summary: 633058.31 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        4.933     0.064     4.911     5.151     5.375     9.327

8.1: Pipe 1

Summary:
  throughput summary: 584487.69 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.327     0.040     0.335     0.463     0.519     2.623

Pipe 16

Summary:
  throughput summary: 653879.69 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        4.777     0.056     4.743     5.007     6.431     8.223

unstable: Pipe 1

Summary:
  throughput summary: 567923.69 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.337     0.032     0.351     0.463     0.519     4.511

Pipe 16

Summary:
  throughput summary: 747924.50 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        4.170     0.096     4.167     4.287     4.431     8.743

Conclusions: Without pipelining performance is significantly better in 8.0/8.1 compared to 7.2. With pipelining 16 performance is significantly better in 7.2 compared to 8.0/8.1.

I rebased this branch onto unstable in JoBeR007/valkey@rebased but I think i need to take a closer look at stability of this PR being applied to unstable, as some tests are now failing.

Aug 12 '25 10:08 dvkashapov

@dvkashapov Thank you! I have rebased it locally and fixed some problems, but I still get failures locally. I can push what I have so we can debug it together. It would indeed be nice to get this in 9.0 (even after RC1 might be acceptable).

Aug 12 '25 10:08 zuiderkwast

@valkey-io/core-team This PR has been updated and I believe this is working well now. Can we review it and get it into 9.0-rc2?

As @JoBeR007 pointed out, 8.x is slower than 7.2 when pipelining is used, and even 9.0 would be unless we take this PR (or something like it).

This is also supposed to compensate for our reject of cross-slot MGET. This PR makes multiple pipelined small MGET or GET can be must faster than today.

Aug 13 '25 14:08 zuiderkwast

I was able to reproduce the test locally. It's a bit flaky. It seems like the migration is getting stuck somehow.

Aug 19 '25 03:08 madolson

We're getting de-sync errors in the test:

24786:S 19 Aug 2025 09:52:35.066 # Protocol error (Master using the inline protocol. Desync?) from client: id=32 addr=127.0.0.1:21116 laddr=127.0.0.1:61112 fd=25 name= age=11 idle=0 flags=M capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=16010 qbuf-free=49520 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=67136 events=r cmd=set user=(superuser) redir=-1 resp=2 lib-name= lib-ver= tot-net-in=12800433 tot-net-out=451 tot-cmds=20579. Query buffer: '6ZJ}:226..*2..$6..UNLINK..$8..{6ZJ}:22..*2..$6..UNLINK..$8..{6ZJ' (... more 14880 bytes ...) '..*2..$6..UNLINK..$9..{6ZJ}:207..*2..$6..UNLINK..$9..{6ZJ}:280..'

Later in the day update: The target eventually stops processing incoming requests, but it is not consistent. There doesn't seem to be anything non-deterministic about it that I can see outside of the speed of processing. @murphyjacob4 Since your back, if you have any thoughts as well for the test failures.

Just to comment since Viktor messaged me on slack. Apparently there is something getting incorrectly updated for replication offset. This will have impact for other deeply pipelined replication streams, but it looks like only the slot migration tests are triggering that.

Aug 19 '25 16:08 madolson

Apparently there is something getting incorrectly updated for replication offset. This will have impact for other deeply pipelined replication streams, but it looks like only the slot migration tests are triggering that.

Since slot migration uses chaining replication, if the replication offset is wrong, what is sent to the replicas of the target would be wrong:

https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L3677-L3689

So that would make sense why we see the desync errors

Aug 22 '25 21:08 murphyjacob4

valkey valkey copied to clipboard

Parse multiple commands and prefetch keys

Codecov Report

valkey
valkey copied to clipboard