[NEW] Valkey forkless save, sync, slot-migration
(This issue was split off from #1754, and represents the current direction of core team.)
Problem statement
Valkey uses the operating system’s “fork” capability for:
- save - creates a point-in-time snapshot, stored to an RDB file
- full sync - attaches a replica to a primary, performing a full copy of data to the replica
- slot migration - transfers data contents and ownership of a slot from one shard to another
Fork works by creating a second process which SHARES the memory pages of the original Valkey process. All of the pages are marked read-only. When one process alters the data, the OS clones the page to create a writable page (the other process sees the original, unmodified page). This is brilliant in its simplicity; the original Valkey process can continue running while the forked process views a point-in-time copy of the data. However, there are several problems:
- Execution of the “fork” call can be expensive. The new process is created, and the entire memory map must be traversed and marked as read-only. Times exceeding 100ms can often be seen. Valkey is unresponsive during this time.
- Memory usage is difficult to predict. On a busy system, ALL of the pages might need to be cloned. This represents a 2x memory requirement.
- Performance on the main Valkey process is degraded as it bears the responsibility for memory page cloning.
Proposal
Amazon has developed a forkless mechanism which has been used extensively over the past several years. We would like to contribute this functionality to open-source Valkey in several stages:
- save - forkless bgsave (snapshot) is the easiest unit of code to deliver, and will allow delivery of the base components which will support save/full-sync/slot-migration.
- full-sync - uses much of the infrastructure from save, while also introducing in-band replication data.
- slot-migration - uses the capabilities introduced by full sync, but must also be integrated into the new atomic slot-migration framework.
Note that the forkless capability will be delivered as a configurable option. There are advantages/disadvantages of the forkless approach (discussed below) - and this approach may not benefit all users equally.
High-level design for forkless snapshot creation
For the SAVE operation (saving a snapshot to RDB), a point-in-time snapshot must be performed that is consistent at the instant that the BGSAVE command executes.
This is performed as follows:
- Header information is saved to the RDB (from the main thread).
- A utility, “bgIteration”, is used to iterate over all of the key/value pairs in the database, sending them to a background thread for serialization and output. bgIteration ensures that the keys are presented, unmodified.
- For each key, “Threadsave” (a client of bgIteration) serializes the key/value and writes it to the RDB.
- Once iteration completes, bgIteration invokes a final Threadsave callback (on the main thread). This allows the end-marker to be written. Threadsave then schedules the file closure to be handled in the background (using the BIO utility).
bgIteration
The bgIteration utility forms the heart of the logic. This functionality is reused in full-sync and slot-migration. It is a decoupled utility, independent from the actions of save/sync/migration.
For a given task (e.g. save), a bgIterator object is created. bgIteration is designed such that multiple iterations can occur in parallel, each with a separate bgIterator object. (Think slot-migration with several simultaneous targets.)
Once active, the bgIterator sends a series of events (mostly key/value pairs) to the iteration client (on a background thread). The iteration client is responsible for sequentially processing each event. bgIteration is responsible for sequencing events, and all of the interactions with the running Valkey instance.
While a key is being used by bgIteration, it is ensured that the key is not modified by Valkey activity on the main thread. bgIteration intercepts each command. For “write” commands, which will update a currently in-use key, the Valkey client is blocked until bgIteration has completed working on the key. Most “read” commands are unaffected.
In CONSISTENT mode (save), bgIteration provides a point-in-time view.
- Each key in the database is marked with an “epoch”. The epoch is updated at key creation time and at each modification. If the epoch is older than the consistency point, the key will need to be passed to the iteration client when iteration reaches the key. If the epoch is newer than the consistency point, it’s either a new key (which didn’t exist at the consistency point) OR it’s a key that has been modified since the consistency point (and already processed).
- If a write command attempts to modify a key which has not yet been iterated, the Valkey client is blocked, and the key is expedited for immediate (out-of-order) processing. For simple/small keys, this is optimized by cloning the key/value and allowing the Valkey client to proceed immediately, without blocking.
In NON-CONSISTENT mode (full-sync and slot-migration), bgIteration allows Valkey clients to modify keys that haven’t yet been iterated. A bgIteration client can achieve eventual consistency (by the end of the iteration) by processing of curated replication (in addition to the key/value data).
- If a Valkey client performs a write to a key which has already been iterated, the bgIterator will pass the associated replication data to the iteration client.
- If a Valkey client performs a write to a key which has not yet been iterated, the operation is allowed to proceed unhindered (replication is NOT sent). Eventually, the updated key will be processed by iteration.
- It’s only when a key is actively in use by bgIteration that the Valkey client needs to be blocked.
bgIteration supports both consistent and non-consistent iterations, each with the option for replication data. This provides flexibility for future use cases. As an example, a non-consistent iteration, without replication, might be suitable for a statistical analysis scan operation.
Special considerations
- For Threadsave (the bgIteration client), saving of module aux-data (not key/value data) is a challenge. Prior to search capabilities, such data consisted of minimal meta-data. With search, we now see extensive indexes (possibly GBs in size). Most modules are not equipped to generate a point-in-time view of this data in an iterative manner. Saving of this data in a non-iterative manner would create potentially minutes of latency/impact. It is proposed that a fork-based mechanism might need to be used to support collection of the module aux-data. This would incur the cost of the initial fork, but would reduce the amount of copy-on-write (CoW) memory utilized as the forked process would be short lived (as compared to the lifespan of a complete database save). In the short-term, use of the search module may preclude the use of forkless save.
- Commands which address multiple keys must be handled specially in non-consistent mode. If only some of the keys (for the command) have been iterated, the command must be blocked while the remaining keys are expedited. This is necessary to ensure that replication may be properly processed.
- SWAPDB is handled/tracked by bgIteration. In consistent mode the database continues to be presented to the iteration client as it was before the swap operation. In non-consistent mode, the iteration client is notified of the swap. bgIteration will still ensure that each key/value is iterated only once, however, the new DBID is used in events.
- MULTI/EXEC blocks are pre-parsed, identifying all keys (and tracking DB swaps) so that the command can be blocked (or not) as appropriate. There is no possibility of a MULTI block causing blocking during execution.
- LUA scripts can be problematic. Without declared keys it’s impossible for bgIteration to block the client before execution if there is an in-use key. Even if the keys are declared, there can be an issue if a different DB is selected from within the script! In the case that a LUA script attempts a write to an in-use key which hasn’t been pre-declared (or unprocessed key in consistent mode), a synchronous wait is performed, blocking the main thread, until bgIteration has released the key.
- Active defrag is disabled during bgIteration operations. The main thread can’t reallocate items while they are being serialized in a background thread.
- Rehashing is disabled on the current kvstore through the creation of a “safe” iterator.
- Rehashing is disabled for complex items (hash, set, sorted set) while in use by bgIteration.
- FLUSHALL terminates iteration.
- FLUSHDB of a “small” DB results in a notification to the bgIteration client, but does not terminate the iteration. This supports the use case of saving a large database while one or more trivial databased are continually flushed.
Comparison with standard (forked) operations
Benefits of bgIteration/Threadsave:
- Far better memory footprint. Very little additional memory is required during the save operations. During full sync and slot-migration, replication is sent inline, eliminating client-output-buffer (COB) growth. Forked operations can incur >2x memory impacts - up to 2x for CoW memory, plus additional memory to maintain replication COB. This is the primary benefit of forkless save/sync/migration.
- Elimination of latency due to the forking operation. “fork” can take ~100-300ms in some cases.
- Background/forkless operations avoid the use of the COB. This results in one less memory copy for serialized data. In rare cases, Threadsave can be faster than BGSAVE because of this (especially with larger strings).
Benefits of forked save/sync/slot-migration:
- After the initial forking operation, the forked process has little impact on main thread operations. The main thread will only incur additional overhead for page copies. Assuming the forked process is running on a separate processer (and ideally with separate cache), impact to the Valkey main thread is minimal. Alternatively, bgIteration is directly observable on main thread CPU. There is a timer job that regularly feeds the background thread and this can cause small impacts to latency.
- Forked operations are usually faster end-to-end. Once forked, there is a dedicated process performing the operation. It’s typical for Threadsave to be 20-25% slower than a forked save.
- Forked operations are much simpler. This is better for code maintenance.
- Forked operations are better for dealing with large keys. For example, a 10GB sorted set will cause significant client blocking in the forkless approach as the key will be locked to write operations for an extended period of time.
- Forked operations are better for dealing with large module aux-data.
Recommendation: If there is no memory concern, forked bgsave is usually the better choice. However, if memory is constrained, a forkless option is usually superior. If there are large/complex keys and the application is sensitive to write latencies, forkless may not be a good option. Modules with large aux-data (e.g. Search) may preclude use of forkless operations.
Proposed user experience
- A configuration will be provided to enable the use of forkless operations. This will add 4 bytes (average) to each key/value to maintain the epoch value. This must be configured at startup, and is not modifiable.
- A configuration will be provided to default the bgsave operation to use the forkless save. The standard (synchronous) SAVE operation will not be affected.
- The BGSAVE command will be altered to allow an optional parameter indicating the type of save operation. The client will be able to explicitly request a forked or non-forked save. By default, the type of save will be chosen based on the configuration parameters.
- The metrics supported by bgsave will also be updated by forkless save. In addition, forkless save will provide additional metrics including current/last save type and others as requested/required.
- The output (RDB) will be fully compatible with the existing RDB files. However, internally, the keys may exist in a different order. A single DB is not guaranteed to be contiguous.
Proposed software components to be delivered
- FIFO - a time/space efficient FIFO queue, used by bgIteration but suitable for general use
- mutexqueue - a mutex wrapper around FIFO, used by bgIteration, but suitable for general use
- objectGetMetadata - a function and related code to support per-key metadata. This will support the 4-byte epoch on each key (but can be extended for other uses).
- bgIteration - the core logic for performing forkless iterations
- Threadsave - a bgIteration client which performs SAVE (to disk) and full-sync (to socket)
FIFO, mutexqueue, & bgIteration have unit tests. The unit tests for bgIteration are extensive (and necessary). These tests are currently written using gtest/gmock. While I recommend that we incorporate gtest/gmock as an option for unit testing (https://github.com/valkey-io/valkey/issues/2492) these tests could be redesigned to more closely match Valkey unit tests. Without gmock, this would require custom design, in bgIteration, to create a mocking framework to support the tests.
Threadsave requires integration testing. These are currently designed in python, and would have to be redesigned in TCL to match the existing Valkey pattern.
@JimB123 from the description of the CONSISTENT mode
If a write command attempts to modify a key which has not yet been iterated, the Valkey client is blocked, and the key is expedited for immediate (out-of-order) processing.
How can we tell that a key has already been iterated?
This looks like yet another use case that would benefit from implementing multi-versioning in Valkey.