[High level design] Allow Read requests on I/O threads
Overview We propose enhancing Valkey with multi-threaded command execution capabilities to significantly increase throughput from the current 0.92M GET request per second (CME) to over 3.1M GET requests per second. Our prototype demonstrates that offloading read commands to the aync threads can yield up to 12.7x performance improvement for CPU-intensive operations.
Background Valkey 8.0 introduced async threads to handle input/output operations while the main thread focuses on command execution. Despite this advancement, a critical bottleneck remains: all command execution still occurs sequentially in the single main thread, which becomes increasingly problematic as systems scale to more cores and memory. Furthermore, the current architecture's benefits are minimal for workloads dominated by CPU-intensive operations, since offloading only I/O tasks provides limited performance gains when processing is the primary constraint.
Proposed Solution
The new dict-per-slot architecture introduced in Valkey 8.0 provides an opportunity for parallelization. Since each slot has its own dictionary, commands operating on different slots could theoretically execute in parallel without contention.
We suggest leveraging this feature and propose an architectural update to how the main thread interacts with other threads. In this design, the main thread still maintains its role as a central scheduler, distributing tasks among threads to ensure safe access to clients and data structures. The main thread communicates with the worker threads through lock-free producer-consumer queues.
The key enhancement is expanding the job types beyond I/O operations to include execute-command jobs, allowing worker threads to independently process commands. By maintaining this centralized coordination model, we eliminate the need for complex synchronization mechanisms such as locks or atomic operations. Importantly, this enhancement requires only minimal changes to the existing codebase while significantly improving resource utilization and performance scalability
Read commands execution
The picture below depicts the execution of a GET command in the proposed architecture:
In this example, thread #1 was assigned with the epoll_wait job. Upon reception of the epoll job completion, the main thread checks the response of the epoll_wait command and finds a read event on connection 18. It then determines the thread in charge of reading from connection 18, thread #2, and produces a read job on its job queue. Thread #2 reads data from the socket, parses the GET command, determines the slot owning the key and sends a read complete event to the main thread. When receiving the read event completion, the main thread validates the legitimacy of the command, increments the number of pending requests on that slot, post an execute request in the queue the thread in charge of executing commands for the key slot, thread #3. Thread #3 executes the command, writes the response on the connection socket and sends an execute completed event to the main thread, which in turn decrements the number of pending requests on the slot and summarizes the command execution.
Write command execution
The picture below depicts the execution of a SET command in the proposed architecture:
Here, after validating the command, the main thread determines if it can execute the SET command or whether it has to wait until all read active commands on the slot have been completed. In the latter case, it puts the client in a blocked state. When the last read completed event on that slot has been received, the main thread unblocks the client, executes the command and sends a write job to thread #2.
Performance Evaluation We implemented the proposed update into a prototype to evaluate the performance gain in read and read/write scenario .
Test Environment
- Server: c7gn.16xlarge instance
- Clients: 3 x r7g.16xlarge instances
- Configuration: All instances in same placement group (IAD region)
- Valkey 8.1: 8 io-threads (adding additional thread doesn’t improve performance)
- Prototype: 20 threads (1 main thread, 18 worker threads, 1 epoll thread)
- Mode: All tests in cluster mode
Dataset
- Strings: 3M keys with 512-byte values - GET and SET as read and write command respectively
- Hashes: 1M hashes, each with 50 fields (70 bytes/field) - HGETALL and HSET as read and write command respectively
- Sorted Sets: 1M sorted sets, each with 50 members (70 bytes/member) - ZRANK and ZADD as read and write command respectively
- Lists: 1M lists, each with ~50 elements (70 bytes/element) - LINDEX and LSET as read and write command respectively
Benchmark Scenarios
- 100% read operations
- 80% read / 20% write operations
Benchmarks Results: String Operations:
| Workload | Valkey 8.1 | PoC 16xlarge | PoC 4xlarge |
|---|---|---|---|
| 100% Read | 924K | 3,100K | 2,005K |
| 80% R, 20% W | 900K | 1,860K | 1,855K |
Hash Operations:
| Workload | Valkey 8.1 | PoC 16xlarge | PoC 16xl 30 threads | PoC 4xlarge |
|---|---|---|---|---|
| 100% Read | 233K | 2,138K | 2,960K | 1,193K |
| 80% R, 20% W | 266K | 1,597K | NA | 1,325K |
Sorted Set Operations:
| Workload | Valkey 8.1 | PoC 16xlarge | PoC 16xl 9 threads | PoC 4xlarge |
|---|---|---|---|---|
| 100% Read | 465K | 2,837K | NA | 1,612K |
| 80% R, 20% W | 418K | 962K | 1,100K | 1,003K |
List Operations:
| Workload | Valkey 8.1 | PoC 16xlarge | PoC 16xl 11 threads | PoC 4xlarge |
|---|---|---|---|---|
| 100% Read | 817K | 3,090K | NA | 1,940K |
| 80% R, 20% W | 807K | 1,591K | 1,901K | 1,816K |
Key Findings
- String GET: 3.1 million GET requests per second a 3.3x improvement
- HGETALL: 9.1x acceleration, up to 12.7x with 30 threads
- ZRANK: 6.1x acceleration
- Mixed workloads: >2x performance even with non-parallelized writes
- Medium instances (16 cores): Significant improvements in all tests
Conclusion This demonstrates that parallel read command execution can dramatically improve Valkey's performance reaching over 3M transactions per second on a single instance. This approach requires limited code changes while providing significant throughput improvements across various data structures, even when write commands are also invoked
Proposed roadmap We propose to deliver for Valkey 9.0 the proposed update with the following limitations:
- Offloading major commands only
- Cluster Mode Only
- Modules - Modules’ logic can have side effects that create contention conflicts with other async threads. To address this issue, module functions will be executed by the main thread in "exclusive mode" - meaning all async threads not be executing commands at that time. New APIs will be introduced to allow declaring certain module components (commands, keyspace event callbacks, cron jobs) as "slot safe," which will enable module parallelization as well
Worth noting that there are proposals to allow some cross-slot command execution (e.g. https://github.com/valkey-io/valkey/pull/1986). Not necessarily a blocker, but might affect assumptions in this design.
If slot-level write exclusivity is enforced by the main thread, couldn’t SET/WRITE commands also be offloaded ?
If slot-level write exclusivity is enforced by the main thread, couldn’t SET/WRITE commands also be offloaded ?
@xbasel Write commands have many side effects, such as replication, unblocking clients, client tracking etc, which are more complicated to handle safely from the threads.
Amazing!
- I see read and execute can be done in different threads. Wouldn't it be better for L1 cache if these two operations would be done by the same thread?
- What about commands like SCAN and KEYS? They span multiple slots. Do they need a global slot lock?
@zuiderkwast
- I see read and execute can be done in different threads. Wouldn't it be better for L1 cache if these two operations would be done by the same thread?
- What about commands like SCAN and KEYS? They span multiple slots. Do they need a global slot lock?
-
Yes, we can optimize it if the slot isn't already being handled by a different thread.
-
In such cases, exclusivity is required for all slots. We will stop offloading commands until those commands are processed, essentially falling back to the current behavior.
Nice, this is a great development.
Cluster Mode Only
This feels like a major limitation, given that valkey cluster can already horizontally scale. Do we think operating on the slot is the right primitive and can it be modified to support standalone distributions as well?
In such cases, exclusivity is required for all slots. We will stop offloading commands until those commands are processed, essentially falling back to the current behavior.
The difference is that while the system is running at 3M RPS, it will suddenly drop down to 900k RPS when scan operations comes in (which is likely some sort of debugging tool). This could cause significant latency spikes.
Offloading major commands only
What is the proposed list of major commands?
@xbasel Write commands have many side effects, such as replication, unblocking clients, client tracking etc, which are more complicated to handle safely from the threads.
We still have to handle the complex operations like keyspace misses for read notifications and evictions.
@madolson
Cluster Mode Only
This feels like a major limitation, given that valkey cluster can already horizontally scale. Do we think operating on the slot is the right primitive and can it be modified to support standalone distributions as well?
In cluster mode, this change can be implemented with minimal code modifications while maintaining the existing core architecture. Based on community feedback, we can later extend support to non-cluster mode by implementing multiple dictionaries that allow parallel access from multiple threads. This approach will require careful handling of multi-slot commands. While this extension would require more substantial code changes and effort, it is feasible
In such cases, exclusivity is required for all slots. We will stop offloading commands until those commands are processed, essentially falling back to the current behavior.
The difference is that while the system is running at 3M RPS, it will suddenly drop down to 900k RPS when scan operations comes in (which is likely some sort of debugging tool). This could cause significant latency spikes.
We can modify the SCAN command to take exclusivity one slot at a time.
Offloading major commands only
What is the proposed list of major commands?
Theoretically, all read commands that require minimal code changes and access a single slot can be offloaded. We will need to examine each command individually to verify it is thread-safe, ensuring that the commands don't access any global variables or other shared resources
@xbasel Write commands have many side effects, such as replication, unblocking clients, client tracking etc, which are more complicated to handle safely from the threads.
We still have to handle the complex operations like keyspace misses for read notifications and evictions.
Evictions are postponed and executed in the main thread after the command processing thread completes. As for keyspace events the only keyspace event that can be triggered by read-only operations without eviction is the miss event. Initially, we may not offload the command if a keyspace-miss-event is registered, we can later implement this in thread-safe mode.
In cluster mode, this change can be implemented with minimal code modifications while maintaining the existing core architecture. Based on community feedback, we can later extend support to non-cluster mode by implementing multiple dictionaries that allow parallel access from multiple threads. This approach will require careful handling of multi-slot commands. While this extension would require more substantial code changes and effort, it is feasible
I guess I just see the 3M RPS as a little bit artificial, since we can achieve that today anyways with cluster mode, which can scale read and write operations already. With IO threading, it improved functionally all workloads, whereas this comes with a higher number of caveats.
The title of the issue is not descriptive. Execute read commands in io threads is a better title. Also it should be cross-linked with related issues such as #761. I'm just trying to get an overview help others do the same. It gets hard to track all the issues...
first, thank you to @touitou-dan and @uriyage for putting this together and for another massive contribution to Valkey. really appreciate it.
overall, this is a huge and exciting proposal. however, given its scale and the number of subsystems it touches, i don't think this is something we can realistically land in time for Valkey 9.0, which is scheduled for this sept. we're just a few weeks away from the feature merge deadline, and a change this significant needs more time to bake.
I've read through the design and the comments from @madolson, @zuiderkwast, and others. here are some of my thoughts:
-
Madelyn raised a good point about the potential for qps to drop when a write command appears for a slot being actively read by multiple io-threads. I think for this to be a major issue, a couple of conditions need to be met: 1.1 you need a non-trivial number of io-threads running. 1.2 multiple clients need to be hitting the same slot, with READ commands, at roughly the same time, causing the concurrency to drop from N (io-threads) to 1. a single slow command (write or not) is already a concern today since it blocks everything. the second condition is the key variable here imo. it's hard to quantify without data, but my gut feeling is that it's not a common case. of course, we'd need to benchmark this to be sure.
-
I'm more concerned about writer starvation. the current design doesn't seem to mention a mechanism to prevent a writer from getting starved out. if readers are constantly piling up on a slot, a waiting writer might never get its turn. we need some kind of "draining" phase that blocks new readers from queuing up on a slot once a writer is waiting for it.
-
The "cluster mode only" aspect is also a big sticking point. I'd side with Madelyn here. I think solving this for standalone mode is more urgent, relatively speaking. It's not at all straightforward for me to see how this design extends to standalone, where commands can access any key across the keyspace. before we can move forward, I would like to have a much better understanding of the proposed mechanism for handling cross-partition access in standalone mode. I'd consider that a prerequisite.
-
finally, how does this feature interact with the rest of the system? a few things come to mind: 4.1 active defrag: how does this parallel execution model affect it? 4.2 prefetching: I noticed in the associated PR (#2208) that commands sent to io-threads no longer perform prefetching. was that intentional? 4.3 as mentioned by others in this thread, pipelining, modules, RDB snapshots, replication: these are other critical areas that need to be considered. btw, I believe key miss events are delivered to module unconditionally.
Next Steps
This is a fantastic direction, but let's not rush it. here's what I propose:
a) let's target this for Valkey 10.0. this gives us the time to get the design right, build consensus, and test it thoroughly. a change of this magnitude deserves a major version release.
b) let's create a formal RFC. An RFC would be the perfect place to capture the full design, including all the interaction points with the rest of the system (defrag, modules, etc.) and the proposed solution for standalone mode. It will allow the entire community to discuss and refine the proposal. you can use the template here: https://github.com/valkey-io/valkey-rfc/blob/main/TEMPLATE.md
a) let's target this for Valkey 10.0. this gives us the time to get the design right, build consensus, and test it thoroughly. a change of this magnitude deserves a major version release.
@PingXie I totally understand the concerns and risks in introducing large core changes very close to the release date. However I do feel that 10.0 is too far to point as a target release. this is a non-breaking change (AFAIK) so there should be no risk in introducing it in 9.1. I also think that it is valid and we did so already, introducing incremental improvements as long as we do not introduce degradations. I fully support an automatic, "self-tuning" system but I think that even in case we had such a system presented in this case it would not fit ALL cases. The main problem is that some users are testing their workload and NOT noticing the "promised" improvement. That is indeed a problem, but IMO it should not stop us from releasing improvement features but maybe improve the wording and adjust users expectations accordingly. We have also experienced with introducing "experimental" features release (e.g. Second port replication) as a safe way for customers to opt-in to a new internal feature. I would also like to add that this feature was tested under new fuzzer implementation that @uriyage can share that (IMO) greatly raise my confidence in the feature stability. Not only that it was used to locate bugs in this implementation, it also helped locate many issues in earlier versions.
regarding some other concerns raised:
I'm more concerned about writer starvation
I think there is no new risk on "writer starvation" as the commands are scheduled by the same central point, so the order of processing is not affected by the number of io-threads or the decision to offload processing.
The "cluster mode only" aspect is also a big sticking point.
I totally get this point, but I wonder how many customers are expecting the same improvements from both systems? IMO it, again, falls under the potential "confusion" that users might have regarding the different behavior of workloads. Not that I take this point lightly, I just think that even today cluster mode offers different scale of throughput and suffers from issues like hot-slots and topology refresh latency impact, which are only related to cluster mode. I think that this is a valid point, but maybe we should think how big of an issue it might cause to users? (maybe still a big one? IDK)
- @valkey-io/core-team Should be this be in 9.0 or should we defer this to 9.1?
- Defer to 9.1.
- @touitou-dan @uriyage Please document how we are going to evolve this implementation for standalone.
- @soloestoy please review and give input about how this compares to Alibaba's multi-worker solution. In the long term, do we want to move towards a fully concurrent solution?
Guys it is SO big and SO game changing! I beg you to release it in thr 9.1, please please please :)
I bet Redis follows Valkey and will try to release it sooner :)
@yahorsi Is there a specific reason you need it?
@madolson Just general perf improvements that will help to utilize the hardware better, nothing specific
@yahorsi You can run cluster mode today to fully utilize the underlying hardware. This feature is more for workloads that might be CPU bound on large read requests today (SUNIONSTORE for example)
@soloestoy please review and give input about how this compares to Alibaba's multi-worker solution. In the long term, do we want to move towards a fully concurrent solution?
I know you are on paternity leave at the moment, when you are back can you follow up and post the multi-worker design that alibaba uses.