valkey icon indicating copy to clipboard operation
valkey copied to clipboard

[High level design] Allow Read requests on I/O threads

Open touitou-dan opened this issue 8 months ago • 9 comments

Overview We propose enhancing Valkey with multi-threaded command execution capabilities to significantly increase throughput from the current 0.92M GET request per second (CME) to over 3.1M GET requests per second. Our prototype demonstrates that offloading read commands to the aync threads can yield up to 12.7x performance improvement for CPU-intensive operations.

Background Valkey 8.0 introduced async threads to handle input/output operations while the main thread focuses on command execution. Despite this advancement, a critical bottleneck remains: all command execution still occurs sequentially in the single main thread, which becomes increasingly problematic as systems scale to more cores and memory. Furthermore, the current architecture's benefits are minimal for workloads dominated by CPU-intensive operations, since offloading only I/O tasks provides limited performance gains when processing is the primary constraint.

Proposed Solution The new dict-per-slot architecture introduced in Valkey 8.0 provides an opportunity for parallelization. Since each slot has its own dictionary, commands operating on different slots could theoretically execute in parallel without contention. We suggest leveraging this feature and propose an architectural update to how the main thread interacts with other threads. In this design, the main thread still maintains its role as a central scheduler, distributing tasks among threads to ensure safe access to clients and data structures. The main thread communicates with the worker threads through lock-free producer-consumer queues. Image The key enhancement is expanding the job types beyond I/O operations to include execute-command jobs, allowing worker threads to independently process commands. By maintaining this centralized coordination model, we eliminate the need for complex synchronization mechanisms such as locks or atomic operations. Importantly, this enhancement requires only minimal changes to the existing codebase while significantly improving resource utilization and performance scalability

Read commands execution The picture below depicts the execution of a GET command in the proposed architecture: Image In this example, thread #1 was assigned with the epoll_wait job. Upon reception of the epoll job completion, the main thread checks the response of the epoll_wait command and finds a read event on connection 18. It then determines the thread in charge of reading from connection 18, thread #2, and produces a read job on its job queue. Thread #2 reads data from the socket, parses the GET command, determines the slot owning the key and sends a read complete event to the main thread. When receiving the read event completion, the main thread validates the legitimacy of the command, increments the number of pending requests on that slot, post an execute request in the queue the thread in charge of executing commands for the key slot, thread #3. Thread #3 executes the command, writes the response on the connection socket and sends an execute completed event to the main thread, which in turn decrements the number of pending requests on the slot and summarizes the command execution.

Write command execution The picture below depicts the execution of a SET command in the proposed architecture: Image Here, after validating the command, the main thread determines if it can execute the SET command or whether it has to wait until all read active commands on the slot have been completed. In the latter case, it puts the client in a blocked state. When the last read completed event on that slot has been received, the main thread unblocks the client, executes the command and sends a write job to thread #2.

Performance Evaluation We implemented the proposed update into a prototype to evaluate the performance gain in read and read/write scenario .

Test Environment

  • Server: c7gn.16xlarge instance
  • Clients: 3 x r7g.16xlarge instances
  • Configuration: All instances in same placement group (IAD region)
  • Valkey 8.1: 8 io-threads (adding additional thread doesn’t improve performance)
  • Prototype: 20 threads (1 main thread, 18 worker threads, 1 epoll thread)
  • Mode: All tests in cluster mode

Dataset

  • Strings: 3M keys with 512-byte values - GET and SET as read and write command respectively
  • Hashes: 1M hashes, each with 50 fields (70 bytes/field) - HGETALL and HSET as read and write command respectively
  • Sorted Sets: 1M sorted sets, each with 50 members (70 bytes/member) - ZRANK and ZADD as read and write command respectively
  • Lists: 1M lists, each with ~50 elements (70 bytes/element) - LINDEX and LSET as read and write command respectively

Benchmark Scenarios

  1. 100% read operations
  2. 80% read / 20% write operations

Benchmarks Results: String Operations:

Workload Valkey 8.1 PoC 16xlarge PoC 4xlarge
100% Read 924K 3,100K 2,005K
80% R, 20% W 900K 1,860K 1,855K

Hash Operations:

Workload Valkey 8.1 PoC 16xlarge PoC 16xl 30 threads PoC 4xlarge
100% Read 233K 2,138K 2,960K 1,193K
80% R, 20% W 266K 1,597K NA 1,325K

Sorted Set Operations:

Workload Valkey 8.1 PoC 16xlarge PoC 16xl 9 threads PoC 4xlarge
100% Read 465K 2,837K NA 1,612K
80% R, 20% W 418K 962K 1,100K 1,003K

List Operations:

Workload Valkey 8.1 PoC 16xlarge PoC 16xl 11 threads PoC 4xlarge
100% Read 817K 3,090K NA 1,940K
80% R, 20% W 807K 1,591K 1,901K 1,816K

Key Findings

  • String GET: 3.1 million GET requests per second a 3.3x improvement
  • HGETALL: 9.1x acceleration, up to 12.7x with 30 threads
  • ZRANK: 6.1x acceleration
  • Mixed workloads: >2x performance even with non-parallelized writes
  • Medium instances (16 cores): Significant improvements in all tests

Conclusion This demonstrates that parallel read command execution can dramatically improve Valkey's performance reaching over 3M transactions per second on a single instance. This approach requires limited code changes while providing significant throughput improvements across various data structures, even when write commands are also invoked

Proposed roadmap We propose to deliver for Valkey 9.0 the proposed update with the following limitations:

  • Offloading major commands only
  • Cluster Mode Only
  • Modules - Modules’ logic can have side effects that create contention conflicts with other async threads. To address this issue, module functions will be executed by the main thread in "exclusive mode" - meaning all async threads not be executing commands at that time. New APIs will be introduced to allow declaring certain module components (commands, keyspace event callbacks, cron jobs) as "slot safe," which will enable module parallelization as well

touitou-dan avatar Apr 29 '25 08:04 touitou-dan

Worth noting that there are proposals to allow some cross-slot command execution (e.g. https://github.com/valkey-io/valkey/pull/1986). Not necessarily a blocker, but might affect assumptions in this design.

xbasel avatar Apr 29 '25 12:04 xbasel

If slot-level write exclusivity is enforced by the main thread, couldn’t SET/WRITE commands also be offloaded ?

xbasel avatar Apr 29 '25 12:04 xbasel

If slot-level write exclusivity is enforced by the main thread, couldn’t SET/WRITE commands also be offloaded ?

@xbasel Write commands have many side effects, such as replication, unblocking clients, client tracking etc, which are more complicated to handle safely from the threads.

uriyage avatar Apr 29 '25 12:04 uriyage

Amazing!

  1. I see read and execute can be done in different threads. Wouldn't it be better for L1 cache if these two operations would be done by the same thread?
  2. What about commands like SCAN and KEYS? They span multiple slots. Do they need a global slot lock?

zuiderkwast avatar Apr 29 '25 12:04 zuiderkwast

@zuiderkwast

  1. I see read and execute can be done in different threads. Wouldn't it be better for L1 cache if these two operations would be done by the same thread?
  2. What about commands like SCAN and KEYS? They span multiple slots. Do they need a global slot lock?
  1. Yes, we can optimize it if the slot isn't already being handled by a different thread.

  2. In such cases, exclusivity is required for all slots. We will stop offloading commands until those commands are processed, essentially falling back to the current behavior.

uriyage avatar Apr 29 '25 14:04 uriyage

Nice, this is a great development.

Cluster Mode Only

This feels like a major limitation, given that valkey cluster can already horizontally scale. Do we think operating on the slot is the right primitive and can it be modified to support standalone distributions as well?

In such cases, exclusivity is required for all slots. We will stop offloading commands until those commands are processed, essentially falling back to the current behavior.

The difference is that while the system is running at 3M RPS, it will suddenly drop down to 900k RPS when scan operations comes in (which is likely some sort of debugging tool). This could cause significant latency spikes.

Offloading major commands only

What is the proposed list of major commands?

@xbasel Write commands have many side effects, such as replication, unblocking clients, client tracking etc, which are more complicated to handle safely from the threads.

We still have to handle the complex operations like keyspace misses for read notifications and evictions.

madolson avatar May 05 '25 23:05 madolson

@madolson

Cluster Mode Only

This feels like a major limitation, given that valkey cluster can already horizontally scale. Do we think operating on the slot is the right primitive and can it be modified to support standalone distributions as well?

In cluster mode, this change can be implemented with minimal code modifications while maintaining the existing core architecture. Based on community feedback, we can later extend support to non-cluster mode by implementing multiple dictionaries that allow parallel access from multiple threads. This approach will require careful handling of multi-slot commands. While this extension would require more substantial code changes and effort, it is feasible

In such cases, exclusivity is required for all slots. We will stop offloading commands until those commands are processed, essentially falling back to the current behavior.

The difference is that while the system is running at 3M RPS, it will suddenly drop down to 900k RPS when scan operations comes in (which is likely some sort of debugging tool). This could cause significant latency spikes.

We can modify the SCAN command to take exclusivity one slot at a time.

Offloading major commands only

What is the proposed list of major commands?

Theoretically, all read commands that require minimal code changes and access a single slot can be offloaded. We will need to examine each command individually to verify it is thread-safe, ensuring that the commands don't access any global variables or other shared resources

@xbasel Write commands have many side effects, such as replication, unblocking clients, client tracking etc, which are more complicated to handle safely from the threads.

We still have to handle the complex operations like keyspace misses for read notifications and evictions.

Evictions are postponed and executed in the main thread after the command processing thread completes. As for keyspace events the only keyspace event that can be triggered by read-only operations without eviction is the miss event. Initially, we may not offload the command if a keyspace-miss-event is registered, we can later implement this in thread-safe mode.

uriyage avatar May 07 '25 10:05 uriyage

In cluster mode, this change can be implemented with minimal code modifications while maintaining the existing core architecture. Based on community feedback, we can later extend support to non-cluster mode by implementing multiple dictionaries that allow parallel access from multiple threads. This approach will require careful handling of multi-slot commands. While this extension would require more substantial code changes and effort, it is feasible

I guess I just see the 3M RPS as a little bit artificial, since we can achieve that today anyways with cluster mode, which can scale read and write operations already. With IO threading, it improved functionally all workloads, whereas this comes with a higher number of caveats.

madolson avatar May 12 '25 17:05 madolson

The title of the issue is not descriptive. Execute read commands in io threads is a better title. Also it should be cross-linked with related issues such as #761. I'm just trying to get an overview help others do the same. It gets hard to track all the issues...

zuiderkwast avatar May 13 '25 08:05 zuiderkwast

first, thank you to @touitou-dan and @uriyage for putting this together and for another massive contribution to Valkey. really appreciate it.

overall, this is a huge and exciting proposal. however, given its scale and the number of subsystems it touches, i don't think this is something we can realistically land in time for Valkey 9.0, which is scheduled for this sept. we're just a few weeks away from the feature merge deadline, and a change this significant needs more time to bake.

I've read through the design and the comments from @madolson, @zuiderkwast, and others. here are some of my thoughts:

  1. Madelyn raised a good point about the potential for qps to drop when a write command appears for a slot being actively read by multiple io-threads. I think for this to be a major issue, a couple of conditions need to be met: 1.1 you need a non-trivial number of io-threads running. 1.2 multiple clients need to be hitting the same slot, with READ commands, at roughly the same time, causing the concurrency to drop from N (io-threads) to 1. a single slow command (write or not) is already a concern today since it blocks everything. the second condition is the key variable here imo. it's hard to quantify without data, but my gut feeling is that it's not a common case. of course, we'd need to benchmark this to be sure.

  2. I'm more concerned about writer starvation. the current design doesn't seem to mention a mechanism to prevent a writer from getting starved out. if readers are constantly piling up on a slot, a waiting writer might never get its turn. we need some kind of "draining" phase that blocks new readers from queuing up on a slot once a writer is waiting for it.

  3. The "cluster mode only" aspect is also a big sticking point. I'd side with Madelyn here. I think solving this for standalone mode is more urgent, relatively speaking. It's not at all straightforward for me to see how this design extends to standalone, where commands can access any key across the keyspace. before we can move forward, I would like to have a much better understanding of the proposed mechanism for handling cross-partition access in standalone mode. I'd consider that a prerequisite.

  4. finally, how does this feature interact with the rest of the system? a few things come to mind: 4.1 active defrag: how does this parallel execution model affect it? 4.2 prefetching: I noticed in the associated PR (#2208) that commands sent to io-threads no longer perform prefetching. was that intentional? 4.3 as mentioned by others in this thread, pipelining, modules, RDB snapshots, replication: these are other critical areas that need to be considered. btw, I believe key miss events are delivered to module unconditionally.

Next Steps

This is a fantastic direction, but let's not rush it. here's what I propose:

a) let's target this for Valkey 10.0. this gives us the time to get the design right, build consensus, and test it thoroughly. a change of this magnitude deserves a major version release.

b) let's create a formal RFC. An RFC would be the perfect place to capture the full design, including all the interaction points with the rest of the system (defrag, modules, etc.) and the proposed solution for standalone mode. It will allow the entire community to discuss and refine the proposal. you can use the template here: https://github.com/valkey-io/valkey-rfc/blob/main/TEMPLATE.md

PingXie avatar Jul 05 '25 21:07 PingXie

a) let's target this for Valkey 10.0. this gives us the time to get the design right, build consensus, and test it thoroughly. a change of this magnitude deserves a major version release.

@PingXie I totally understand the concerns and risks in introducing large core changes very close to the release date. However I do feel that 10.0 is too far to point as a target release. this is a non-breaking change (AFAIK) so there should be no risk in introducing it in 9.1. I also think that it is valid and we did so already, introducing incremental improvements as long as we do not introduce degradations. I fully support an automatic, "self-tuning" system but I think that even in case we had such a system presented in this case it would not fit ALL cases. The main problem is that some users are testing their workload and NOT noticing the "promised" improvement. That is indeed a problem, but IMO it should not stop us from releasing improvement features but maybe improve the wording and adjust users expectations accordingly. We have also experienced with introducing "experimental" features release (e.g. Second port replication) as a safe way for customers to opt-in to a new internal feature. I would also like to add that this feature was tested under new fuzzer implementation that @uriyage can share that (IMO) greatly raise my confidence in the feature stability. Not only that it was used to locate bugs in this implementation, it also helped locate many issues in earlier versions.

regarding some other concerns raised:

I'm more concerned about writer starvation

I think there is no new risk on "writer starvation" as the commands are scheduled by the same central point, so the order of processing is not affected by the number of io-threads or the decision to offload processing.

The "cluster mode only" aspect is also a big sticking point.

I totally get this point, but I wonder how many customers are expecting the same improvements from both systems? IMO it, again, falls under the potential "confusion" that users might have regarding the different behavior of workloads. Not that I take this point lightly, I just think that even today cluster mode offers different scale of throughput and suffers from issues like hot-slots and topology refresh latency impact, which are only related to cluster mode. I think that this is a valid point, but maybe we should think how big of an issue it might cause to users? (maybe still a big one? IDK)

ranshid avatar Jul 06 '25 13:07 ranshid

  1. @valkey-io/core-team Should be this be in 9.0 or should we defer this to 9.1?
    1. Defer to 9.1.
  2. @touitou-dan @uriyage Please document how we are going to evolve this implementation for standalone.
  3. @soloestoy please review and give input about how this compares to Alibaba's multi-worker solution. In the long term, do we want to move towards a fully concurrent solution?

madolson avatar Jul 07 '25 14:07 madolson

Guys it is SO big and SO game changing! I beg you to release it in thr 9.1, please please please :)

yahorsi avatar Aug 12 '25 20:08 yahorsi

I bet Redis follows Valkey and will try to release it sooner :)

yahorsi avatar Aug 12 '25 20:08 yahorsi

@yahorsi Is there a specific reason you need it?

madolson avatar Aug 13 '25 04:08 madolson

@madolson Just general perf improvements that will help to utilize the hardware better, nothing specific

yahorsi avatar Aug 13 '25 08:08 yahorsi

@yahorsi You can run cluster mode today to fully utilize the underlying hardware. This feature is more for workloads that might be CPU bound on large read requests today (SUNIONSTORE for example)

madolson avatar Aug 13 '25 16:08 madolson

@soloestoy please review and give input about how this compares to Alibaba's multi-worker solution. In the long term, do we want to move towards a fully concurrent solution?

I know you are on paternity leave at the moment, when you are back can you follow up and post the multi-worker design that alibaba uses.

madolson avatar Aug 18 '25 15:08 madolson