valkey icon indicating copy to clipboard operation
valkey copied to clipboard

Multithreaded BGSave

Open madolson opened this issue 11 months ago • 4 comments

One of the bottlenecks that we often observe in AWS is the time it takes to do a full sync either when replacing a node or bootstrapping a new nodes into a cluster. We can partially solve this bottleneck by supporting the usage of additional threads for a multi-threaded save and restore. The save can be trivially parallelized by either having each thread own a subset of the slots (for cluster mode) or a subset of the hash table buckets (for standalone). The restore is a bit less trivial, but by properly setting the initial state of the dictionary, we should be able to safely load data into it from multiple threads.

Each of these threads can use its own TCP connection for transmitting the full-sync data. The primary and replica will need to negotiate the number of RDB full sync connections to use.

madolson avatar Dec 24 '24 05:12 madolson

Each of these threads can use its own TCP connection for transmitting the full-sync data. The primary and replica will need to negotiate the number of RDB full sync connections to use.

Does your suggestion here implies that the feature will only work for rd-channel enabled primary/replica?

I think we can also look at some ways we can use a single multiplexing connection to pass all traffic on a single stream. For example QUIC can help logically separate the data sent by each thread OR we can also logically multiplex the slot/bucket data using a dedicated headers which will proceed each data chunk (like crc is).

ranshid avatar Dec 24 '24 05:12 ranshid

I think we can also look at some ways we can use a single multiplexing connection to pass all traffic on a single stream. For example QUIC can help logically separate the data sent by each thread OR we can also logically multiplex the slot/bucket data using a dedicated headers which will proceed each data chunk (like crc is).

I suppose this is related, https://github.com/valkey-io/valkey/issues/1300.

madolson avatar Dec 24 '24 06:12 madolson

I like this idea!

By the way, what you described in this issue is the diskless full sync. I think there's also value in speeding up the RDB snapshot (to disk). We'll likely need to use a "scatter-gather" solution that writes different parts of the keyspace in parallel to separate files and then concatenates them to form the final RDB file.

PingXie avatar Jan 08 '25 06:01 PingXie

Currently, BGSAVE is not the bottleneck. The limiting factor is on the replica side during RDB processing. There's ongoing work to improve replica-side RDB handling (I'll open an issue and update here). Until then, there's little benefit in optimizing BGSAVE further. (there's still benefit when saving to file though)

xbasel avatar May 18 '25 07:05 xbasel

Currently, BGSAVE is not the bottleneck. The limiting factor is on the replica side during RDB processing. There's ongoing work to improve replica-side RDB handling (I'll open an issue and update here). Until then, there's little benefit in optimizing BGSAVE further. (there's still benefit when saving to file though)

My intention with this issue is that it covers both the primary and replica side. I assumed we would send it across multiple links, so the replica would also need multiple threads handling and parsing the input.

madolson avatar May 29 '25 18:05 madolson

Hi! I am interested in working on this issue. @PingXie could you assign this to me please? Thanks!

Nicky-2000 avatar Jun 05 '25 22:06 Nicky-2000

Hi everyone!

I have made a PR #2470 that introduces Multi Threaded RDB Save/BGSave. The performance testing results were really positive with multi-threading giving us a 3x+ speedup on many workloads! Please see the comment I left on the PR for the full performance results and design overview.

Right now for Full Sync we do not open up multiple connections to the replica. Instead we parallelized the computation and stream all data to a single "target" Rio. That being said, this design makes transitioning to multiple connections between Primary and Replica feasible so we can push towards that in the future.

One thing I want to emphasize is that with a large number of threads, multi-threading essentially offsets the computational cost of compression, making it negligible. So it could be worth it to turn compression on or use a heavier compression algorithm when multi-thread RDB Save is on. Doing this could result in RDB Save running faster with compression enabled compared to when it is disabled due to the smaller file size.

If anyone has some time to review the code I would really appreciate it!

Also, just for visibility, I am currently working on multi-threading the RDB Load functionality to give us an end-to-end multi-threaded full sync.

Nicky-2000 avatar Aug 12 '25 20:08 Nicky-2000

I have made a PR https://github.com/valkey-io/valkey/pull/2470 that introduces Multi Threaded RDB Save/BGSave. The performance testing results were really positive with multi-threading giving us a 3x+ speedup on many workloads! Please see the comment I left on the PR for the full performance results and design overview.

Have you done any perf testing to see where the bottleneck is? When we tested it at AWS, we were able to get about a 60x with multiple connections, so maybe it's worth doing now?

One thing I want to emphasize is that with a large number of threads, multi-threading essentially offsets the computational cost of compression, making it negligible. So it could be worth it to turn compression on or use a heavier compression algorithm when multi-thread RDB Save is on. Doing this could result in RDB Save running faster with compression enabled compared to when it is disabled due to the smaller file size.

@sarthakaggarwal97 Would you take a look? I know you were looking at updating the compression.

madolson avatar Aug 13 '25 04:08 madolson

@sarthakaggarwal97 Would you take a look? I know you were looking at updating the compression.

Yes, I am planning to take a look. I took an initial glance, and the tests have been conducted with valkey-benchmark, where none of the values aren't compressed using LZF or LZ4 irrespective of value size. I would be interested in those scenarios as well where the workload is kind of real-world (with legible words, JSON structure maybe?) and compressible. Also, I am curious about at how many threads do we see this behaviour.

sarthakaggarwal97 avatar Aug 13 '25 05:08 sarthakaggarwal97

Thank you for the feedback!

would be interested in those scenarios as well where the workload is kind of real-world (with legible words, JSON structure maybe?) and compressible. Also, I am curious about at how many threads do we see this behaviour.

I'll run some tests with some more complex/real-world workloads. I am expecting that we would see a higher speed up with these workloads due to additional CPU cycles required for encoding. Do you have recommendations on types of workloads you would like to see (different JSON sizes?

Have you done any perf testing to see where the bottleneck is? When we tested it at AWS, we were able to get about a 60x with multiple connections, so maybe it's worth doing now?

I've done perf testing on the SAVE/BGSave operations but not Full Sync yet. My assumption was that they would have the same bottlenecks but I should confirm that.

What I found from perf testing SAVE:

  • Before the changes we were CPU bound (for small keys or if compression is on)
  • After the changes (with enough threads), we hit an IO bottleneck (max out the 720 M/s disk IO throughput on the test VM).

That being said, this testing was done with key/values generated with valkey-benchmark. I believe data that has a more complex encoding process (JSON) would result in us being CPU bound for a larger range of key/value sizes.

My action items:

  • I will confirm the bottlenecks for Full Sync.
  • I will test with some more complex workloads to see how we are doing.
  • I'll post an update here, and add relevant figures to the PR comment.

Nicky-2000 avatar Aug 13 '25 20:08 Nicky-2000

Following up on our discussion. Here are the bottlenecks I found and the implementation plan I think makes the most sense.

Bottlenecks for Full Sync

  1. On the SAVE Side we are CPU bottlenecked for small keys, complex data types (JSON), and whenever compression is ON.
  2. For Save we become Disk IO bottlenecked with more threads (no value in saving with multiple files on disk unless there are multiple disks on the machine and a way to distribute the storage locations).
  3. For Full Sync: We are initially CPU bound (same workloads as SAVE), adding more threads enables us to hit the single TCP socket throughput IO limit. So there IS value in adding more connections. But I believe we are bottlenecked by how fast the replica can load data before this becomes and issue.
  4. On the Load side, we are CPU bound for the small keys, complex data and compression ON workloads as well.

Testing on More complex workloads (JSON).

  • The more complex the encoding process is the more speedup we see with the current version of RDB threads.
  • We also see that running with compression ON is faster than running with compression OFF
    • Cool, expected, data: Min Save Time Compression OFF ( with X num threads) / Min Save Time Compression ON (with Y Threads) ~= Compression OFF RDB File Size / Compression ON RDB File size. Basically this means that the speed up of compression ON versus compression OFF is proportional to the compression ratio!

Implementation Plan:

Based off these findings, and my time constraints, the most :

  1. Review and merge https://github.com/valkey-io/valkey/pull/2470
    • Gives us RDB threads and solves the CPU bottleneck on the save side of full sync (and snapshotting).
    • Provides a natural foundation to move to multiple connections during full sync.
  2. Solve the CPU bottleneck on the Load side for the single connection version (Code in progress, aiming for end of next week draft PR).
  3. Implement multiple connections in September (Note: I am planning on contributing in September so I can work on this. However, I will have less time because I will also be in school).
    • It makes sense to do this last because we need 1. and 2. to be done in order to ensure we hit the max single TCP connections throughput.
    • Also the additional complexity of doing this right now might slow me down significantly, so I would prefer to break it up into smaller pieces.

Nicky-2000 avatar Aug 15 '25 19:08 Nicky-2000

@sarthakaggarwal97 @murphyjacob4 Could I get some reviews on the multi-thread RDB Save PR? I would love to get some momentum on that PR.

Thanks so much!!

Nicky-2000 avatar Sep 10 '25 21:09 Nicky-2000