[NEW] Valkey fork-less replication
The problem/use-case that the feature addresses
Background Current implementation of full sync and proposed implementation of full sync and proposed implementation of the slot migration both rely on the posix fork mechanism. This approach comes with a couple drawbacks:
- For the memory bound workflows it's either hard to predict required memory headroom or we need 2x memory to be safe.
- Initial fork call takes time reducing system availability.
- Hard to make multiple syncs / migrations in parallel (see (1))
- Increased linux core usage to make copy on write and schedule new process
- If we decide to mitigate (4, 1) by pausing writes it will increase write latency and delay eventual consistency
Goals Implement full sync and slot migration without downtime, latency spikes and with predictable memory headroom requirements.
Non goals This proposal is focused on fork-less slot replication algorithms and doesn’t cover operator logic.
Description of the feature
Basic approach Valkey stores data per slot which is useful in our case. When main thread decides to copy slot:
- Freeze slot data
- Notify IO thread that it should copy frozen slot (note: can spread that job across multiple IO threads as well)
- Create a separate empty hashmap for updates, make the number of buckets the same as the number of buckets in the underlying frozen hashmap. Set all bucket pointers to nullptr.
- On reads: check new hashmap, if bucket is nullptr go to the frozen hashmap, otherwise use new hashmap.
- On writes/updates:
a. if the bucket is nullptr, copy the bucket structure, make the same chain of keys, but the value pointer should point to the frozen version.
b. Find a key, (if it's missing, append a node to the chain), if pointer points to the frozen value, make a new value object, change pointer, copy value and make a write/update.
- Stream write updates in parallel (dual channel replication), updates are stored in replica and being applied once snapshot is processed.
- Once relocation is finished, collapse the new hashmap into frozen copy (and then delete the new hashmap and unfreeze). On each main thread iteration, collapse only a limited subset of keys to make latency predictable. Don’t copy values, just change pointers, don’t forget to add new keys.
Slot copy flow and throttling To make memory usage and performance more predictable lets limit the number of slots that can be moved in parallel (this should be a config value, lets call it N).
- At each point the main thread retains lists of slots being currently copying and slots that should be copied.
- When the main thread receives a command to copy a slot, it puts it to the queue list if capacity is currently reached.
- Once a slot is fully copied (including hashmap collapse) the main thread would decide to start moving another one.
- Prioritize move slots operations over full sync.
- Among full sync, prioritize replica with less slots to be copied (so it can get fully online earlier)
- If multiple replicas need the same slot, process them together.
- Don’t do defrag or rehashing during the slot copy (for that slot), put those operations into the queue at lower priority.
- Scan cursors: restart bucket if we detect bucket state changed between being replicated and normal.
Alternatives you've considered
Instead of streaming commands, (that are accumulated in the replica and processed later after the snapshot is received) lets stream the resulting key-value (including tombstones) and sequence number. This will allow us to apply both snapshot and increment in parallel.
- Make an external (to regular data structure) index key->sequence number.
- No matter what the source, apply a new value if the sequence number is higher.
- Delete external index key->sequence number once replication is done.
Note: in this scheme we don’t actually need a frozen snapshot (same sequence number), but rather we have to ensure that all keys are streamed at least once. Caveat: some values could be big and composite (like HSET), streaming the whole value could increase cost, making subkey versioning should mitigate it, but adds a lot of complexity making this approach less feasible.
Thanks @ebarskiy for starting this discussion. Tagging a few folks to get their initial thoughts @madolson @PingXie @zuiderkwast
Yes, this is very interesting but also quite complex.
I agree with the problems of fork, especially "we need 2x memory to be safe". We are wasting a lot of memory to be safe. We absolutely don't want the OOM-killer to kill us. Dumping one slot at a time would save a lot of memory.
A manual copy-on-write for slots sounds like a very powerful primitive. It's not unlike MVCC per slot. We could use it for other cool stuff, maybe. Transactions and scripts with read-committed / read-uncommitted semantics. Rollback of scripts; no more unkillable scripts. Distributed transactions.......
Some random reflections:
- Freeze slot data
- Notify IO thread that it should copy frozen slot (note: can spread that job across multiple IO threads as well)
It's great that this can be done by IO threads, but we should not forget single-threaded mode here. You get the most out of the CPU if scale horizontally with many single-threaded cluster nodes. If we get rid of the forking, a node can run on a VM or pod with only a single core. Currently, such node needs to be deployed with two cores, just have an extra core ready for the few times we have a forked child.
- Create a separate empty hashmap for updates, make the number of buckets the same as the number of buckets in the underlying frozen hashmap.
- On reads: check new hashmap, if bucket is nullptr go to the frozen hashmap, otherwise use new hashmap.
- On writes/updates:> a. if the bucket is nullptr, copy the bucket structure, make the same chain of keys, but the value pointer should point to the frozen version. b. Find a key, (if it's missing, append a node to the chain), if pointer points to the frozen value, make a new value object, change pointer, copy value and make a write/update.
I imagine we use robj's reference counter for this manual copy-on-write. Is this your idea?
For large non-string keys (hashes, sets, lists, sorted-sets, streams) copying it can be slow and cause a latency spike. Maybe we should consider a nested freeze + copy-on-write for these, as you mentioned. Another option to consider is to block the client until its unfrozen to avoid duplicating the huge key.
Another caveat is module-defined keys. Duplicate a value is possible for all types except module-defined datatypes. It is possible if the module provides a copy callback for it, but it's not mandatory. This is used by the COPY command. It can return -ERR not supported for this module key in that case. This is a problem that we need to solve. Maybe we can serialize it immediately instead of doing the copy-on-write. Or block the client until the slot is unfrozen to avoid duplicating the key.
Currently, we don't expose the internal details like bucket structure in the hashtable API. Are you proposing that we shall expose these internals to the caller (kvstore?) or should we do this internally in the hashtable implementation and expose a freeze or some kind of copy-on-write API?
Don’t do defrag or rehashing during the slot copy (for that slot), put those operations into the queue at lower priority.
It's interesting that you mention incremental rehashing here. The use of two tables for the freeze + copy-on-write thing seems very similar to how incremental rehashing is implemented with two tables internally: table[0] and table[1]. Rehashing works with two tables where keys are incrementally moved from one to the other. Reads check both tables. Writes (except deletes) go to the new table. If we don't need to support incremental rehashing and copy-on-write at the same time, then maybe we can use the same two tables internally? We may get some things like scan for free, because it can work the same as it does during rehashing.
On the other hand, if we implement freeze + copy-on-write outside the hashtable using two separate hashtable instances, then I don't think we shall require them to be of the same size and have the same bucket structure. We can simply do two separate lookups in two separate tables. When we delete a key in the non-frozen copy, we could use a special tombstone object. It allows us to distinguish this case from the case when we fallback to lookup the frozen table. When we merge this tombstone into the frozen table, we delete the key. I don't think it's that hard to implement scan either. We can scan both tables in parallel using the same cursor.
It's great that this can be done by IO threads, but we should not forget single-threaded mode here. You get the most out of the CPU if scale horizontally with many single-threaded cluster nodes. If we get rid of the forking, a node can run on a VM or pod with only a single core. Currently, such node needs to be deployed with two cores, just have an extra core ready for the few times we have a forked child.
Good point, in case of disabled IO threads we should copy snapshot from the main thread, do it in chunks as part of the main loop.
I imagine we use robj's reference counter for this manual copy-on-write. Is this your idea?
On write, we need an indicator inside the new hashmap if it owns the object or points to a copy, checking referense counter is a good option.
For large non-string keys (hashes, sets, lists, sorted-sets, streams) copying it can be slow and cause a latency spike. Maybe we should consider a nested freeze + copy-on-write for these, as you mentioned. Another option to consider is to block the client until its unfrozen to avoid duplicating the huge key.
Totally agree. I'm a bit biased against blocking anything, but mb making it a config option is fine. Nested copy on write is awesome and straightforward for a simple cases like hash maps. But it gets complicated for lists and various concatenations/intersections operations on collections. whie everything is possible, I feel that required complexity to solve all those cases makes it harder to justify, at least for the first version.
Another caveat is module-defined keys. Duplicate a value is possible for all types except module-defined datatypes. It is possible if the module provides a copy callback for it, but it's not mandatory. This is used by the COPY command. It can return -ERR not supported for this module key in that case. This is a problem that we need to solve. Maybe we can serialize it immediately instead of doing the copy-on-write. Or block the client until the slot is unfrozen to avoid duplicating the key.
Really interesting, I should learn more about it. Looking into the documentation https://valkey.io/topics/modules-native-types/ I don't wee the way do define copy. Could you point me to the right direction? Worst case scenario is to copy via serialization, but it feels unnecessary expensive.
Currently, we don't expose the internal details like bucket structure in the hashtable API. Are you proposing that we shall expose these internals to the caller (kvstore?) or should we do this internally in the hashtable implementation and expose a freeze or some kind of copy-on-write API?
Actually, I'd prefer to hide this functionality inside the hashtable itself and expand its api with more commands like fork, snapshot scan and collapse.
It's interesting that you mention incremental rehashing here. The use of two tables for the freeze + copy-on-write thing seems very similar to how incremental rehashing is implemented with two tables internally: table[0] and table[1]. Rehashing works with two tables where keys are incrementally moved from one to the other. Reads check both tables. Writes (except deletes) go to the new table. If we don't need to support incremental rehashing and copy-on-write at the same time, then maybe we can use the same two tables internally? We may get some things like scan for free, because it can work the same as it does during rehashing.
Looks like I should check the code and see if its reusable.
On the other hand, if we implement freeze + copy-on-write outside the hashtable using two separate hashtable instances, then I don't think we shall require them to be of the same size and have the same bucket structure. We can simply do two separate lookups in two separate tables. When we delete a key in the non-frozen copy, we could use a special tombstone object. It allows us to distinguish this case from the case when we fallback to lookup the frozen table. When we merge this tombstone into the frozen table, we delete the key. I don't think it's that hard to implement scan either. We can scan both tables in parallel using the same cursor.
Correct, doing two isolated hashtables would be faster to implement, and mb its indeed the bast way to make a first version. What actually I tried to do while trying to imagine a data structure:
- Low overhead, comparing to two separte hashtables on regular reads/writes
- Low overhead on collapse.
- Spread all additional fork and copy-on-write cost across commands to mitigate latency spikes.
As for the tombstones, we need them either way, and I believe we can just use nullptr instead of a special object?
Perhaps a synchronization method based on multi of files can be considered for this area, and there is a comparison to this: -Fork scheme: Requires double the reserved space, and the fork's system call itself will block the main thread -Forkless scheme: It requires designing complex object versions with multiple versions, and the efficiency may not be high because there are a large number of small memory copies and blocking the main thread
Then, there is another space that stores the entire amount of data, which is the AOF file.
- Implement a new interaction method to transmit AOF to the target end through zero copy.
- Incremental requests are synchronized to the remote end using the dual channel method.
- Support data recovery from AOF
- Utilizing disk space to reduce memory consumption
- It also avoids blocking the main thread process.
The forkless approach at AWS does something slightly different but basically achieves the same effect.
Instead of creating a new empty table, we maintain a a special type of iterator on the slot we are iterating. The iterator goes through each key in the slot, locks it from writes, sends it to a background thread which serializes it, then unlocks the key. The iterator keeps track of where it is in the slot, so we can determine if an incoming write is in front of or behind the iterator. If a write is on a key that hasn't yet been processed, ahead of the iterator, then we process it normally. If it's behind the iterator, we have already processed the key, we then replicate the write command after it's been executed.
If the key is currently being processed, we pause the incoming writes until it's been serialized (or sometimes do a copy on write, there is a lot of optimization room here). The key is logically locked and shared between the IO thread serializing the data. Concurrent reads are allowed on the object while it's being serialized. What we observed is that there are often very large (and usually read heavy). One other thing worth noting is we don't stall the command execution pathway if a command is accessing a blocked key, instead we just executed commands from other clients. So although access to some keys might see a latency spike, the overall server is still able to process commands.
There are a bunch of optimizations here that we have discussed, but haven't done much to implement them since we haven't seen much need for them at our scale.
One of the benefits of this approach is that there is no special secondary table. We had discussed having two hash tables in the past, but then it requires a lot of complexity with the special second table and how it interacts with a bunch of commands.
As was mentioned, forks are very expensive operations. We really only use this approach when we are under low memory scenarios. The default behavior in AWS is the fork based approach since provides a more uniform performance experience.
Looking into the documentation https://valkey.io/topics/modules-native-types/ I don't wee the way do define copy. Could you point me to the right direction? Worst case scenario is to copy via serialization, but it feels unnecessary expensive.
@ebarskiy The documentation seems to be outdated. The structs are defined in valkeymodule.h. In the ValkeyModuleTypeMethods struct, there's a copy callback and a copy2 callback. See here: https://github.com/valkey-io/valkey/blob/8.1.0-rc1/src/valkeymodule.h#L1118
To see what they do, we need to read source code I guess. The call chain in Valkey for the COPY command is copyCommand -> moduleTypeDupOrReply -> mt->copy2 (where mt is a ValkeyModuleType).
The forkless approach at AWS does something slightly different but basically achieves the same effect.
Instead of creating a new empty table, we maintain a a special type of iterator on the slot we are iterating.
@madolson This sounds like a simpler idea.
I'm sad to hear that sometimes fork is still better. I would hope we could just replace fork with something strictly better.
@artikell AOF-based full sync using zerocopy sounds like an interesting idea too. It requires disk storage though (obviously). Currently we use fork for AOF-rewrite, but it can be disabled I think, but without rewrite, the AOF can become very big over time. If we eventually want to get rid of all forking, then we should solve forkless AOF-rewrite too.
I'm sad to hear that sometimes fork is still better. I would hope we could just replace fork with something strictly better.
If it weren't for large collection data types it would be. The simplicity of copy-on-write for memory pages really works well. There is also an interesting world where we implement mulit-threaded BGSave in the background thread, which is fairly straight forward to do and would dramatically reduce CoW.
Instead of creating a new empty table, we maintain a a special type of iterator on the slot we are iterating. The iterator goes through each key in the slot, locks it from writes, sends it to a background thread which serializes it, then unlocks the key. The iterator keeps track of where it is in the slot, so we can determine if an incoming write is in front of or behind the iterator. If a write is on a key that hasn't yet been processed, ahead of the iterator, then we process it normally. If it's behind the iterator, we have already processed the key, we then replicate the write command after it's been executed.
@madolson I thought in that direction, but I'm not sure if there is an efficient way to process various edge commands that cover both keys located before the iterator and after it. Like LMOVE or SUNIONSTORE? Basically any transaction that reads from a key that is not processed yet + writes to a key thats already been copied. It feels like the only option is to serialize 'read' value with the command before sending it to the replica, and that value could be potentially big.
As of performance, whats the impact of per-key locking on the throughput?
I thought in that direction, but I'm not sure if there is an efficient way to process various edge commands that cover both keys located before the iterator and after it. Like LMOVE or SUNIONSTORE? Basically any transaction that reads from a key that is not processed yet + writes to a key thats already been copied. It feels like the only option is to serialize 'read' value with the command before sending it to the replica, and that value could be potentially big.
You are a bit spot on. It's more complexity in the command processing side, but in our testing it was more performant and leaves the core datastructures more in tact. The per-locking performance is rather negligible, since it's mostly L1-L2 cache accesses as long as the number of locked items is rather small. I also think the idea of command locking has some further usage, such as locking large keys to do SUNIONSTORE for example.
Is it difficult to pre lock the key? Especially for scenarios such as Lua and modules. They are transactional operations, but they cannot know the key that needs to be locked before execution. This either disrupts transactional nature or only blocks the execution of all commands.
Especially for scenarios such as Lua and modules.
Lua scripts are supposed to pre-declare keys. I agree about modules though. We might have to fallback to CoW on the module case.
Especially for scenarios such as Lua and modules.
Lua scripts are supposed to pre-declare keys. I agree about modules though. We might have to fallback to CoW on the module case.
Lua may create a key and then execute it. Reference:
EVAL "local warehouse = KEYS[1]; local product = ARGV[1]; local key = "{" .. warehouse .. "}" .. product; local increment = tonumber(ARGV[2]); local newCount = redis.call('INCRBY', key, increment); return newCount" 1 "warehouse1" "productA" 5
In this case, can't the pre-locking operation be achieved? If I'm not mistaken.
Lua scripts are supposed to pre-declare keys
Yeah, but in many cases they don't. It's a workaround that people use deliberately to get cross-slot transactions. Also things like what @artikell showed, but even just using ARGS instead of KEYS. I don't think we can afford breaking this.
Yeah, but in many cases they don't. It's a workaround that people use deliberately to get cross-slot transactions. Also things like what @artikell showed, but even just using ARGS instead of KEYS. I don't think we can afford breaking this.
We don't have to break it, but the performance will be bad as we need to either wait until the key is unblocked, depends on background serialization but the world is frozen, or create a copy of the key (same as what is done in the current suggestion) and unlink the current one from the dictionary, also expensive. Basically, I'm ok with an opt-in feature given sub-optimal performance for edge cases.
Btw, generally speaking the fork-less approach is very easy to do as long as objects are small.
We've been running with this design for almost ~8 years in AWS. We have made some tuning here, and don't see complaints about the performance. There are some ways to iterate on the design to alleviate this like implementing field level blocking (for hash, set, sorted set) as well as range based blocking (for stream and lists), which we never saw enough reason to build. This approach wouldn't solve large modules, but we need a larger conversation about modules still. For example, the VSS search implementation is significantly more performant in the fork case, since it has a consistent snapshot.
I hope we can design something that can replace the fork-based snapshots in all scenarios so we can drop the fork-based method completely. Supporting multiple different ways means high maintenance cost and complexity.
For the AWS service, it makes sense to use opt-in for a new method but to keep supporting the fork-based method by default because it's aligned with Valkey, the open-source reference implementation if you will. We don't need this strategy for Valkey though.
The simplicity of copy-on-write for memory pages really works well. There is also an interesting world where we implement mulit-threaded BGSave in the background thread, which is fairly straight forward to do and would dramatically reduce CoW.
I'm curious what you meant by this @madolson - Do you just mean that the forked process would be multi-threaded to achieve higher throughput and reduce CoW simply by being faster? Or would it be a fork-less implementation where we add CoW support to hashtable and avoid copying memory that isn't user data?
Do you just mean that the forked process would be multi-threaded to achieve higher throughput and reduce CoW simply by being faster?
Yes. We can use multiple threads in the forked process.
Problem statement
Valkey uses the operating system’s “fork” capability for:
- save - creates a point-in-time snapshot, stored to an RDB file
- full sync - attaches a replica to a primary, performing a full copy of data to the replica
- slot migration - transfers data contents and ownership of a slot from one shard to another
Fork works by creating a second process which SHARES the memory pages of the original Valkey process. All of the pages are marked read-only. When one process alters the data, the OS clones the page to create a writable page (the other process sees the original, unmodified page). This is brilliant in its simplicity; the original Valkey process can continue running while the forked process views a point-in-time copy of the data. However, there are several problems:
- Execution of the “fork” call can be expensive. The new process is created, and the entire memory map must be traversed and marked as read-only. Times exceeding 100ms can often be seen. Valkey is unresponsive during this time.
- Memory usage is difficult to predict. On a busy system, ALL of the pages might need to be cloned. This represents a 2x memory requirement.
- Performance on the main Valkey process is degraded as it bears the responsibility for memory page cloning.
Proposal
Amazon has developed a forkless mechanism which has been used extensively over the past several years. We would like to contribute this functionality to open-source Valkey in several stages:
- save - forkless bgsave (snapshot) is the easiest unit of code to deliver, and will allow delivery of the base components which will support save/full-sync/slot-migration.
- full-sync - uses much of the infrastructure from save, while also introducing in-band replication data.
- slot-migration - uses the capabilities introduced by full sync, but must also be integrated into the new atomic slot-migration framework.
Note that the forkless capability will be delivered as a configurable option. There are advantages/disadvantages of the forkless approach (discussed below) - and this approach may not benefit all users equally.
High-level design for forkless snapshot creation
For the SAVE operation (saving a snapshot to RDB), a point-in-time snapshot must be performed that is consistent at the instant that the BGSAVE command executes.
This is performed as follows:
- Header information is saved to the RDB (from the main thread).
- A utility, “bgIteration”, is used to iterate over all of the key/value pairs in the database, sending them to a background thread for serialization and output. bgIteration ensures that the keys are presented, unmodified.
- For each key, “Threadsave” (a client of bgIteration) serializes the key/value and writes it to the RDB.
- Once iteration completes, bgIteration invokes a final Threadsave callback (on the main thread). This allows the end-marker to be written. Threadsave then schedules the file closure to be handled in the background (using the BIO utility).
bgIteration
The bgIteration utility forms the heart of the logic. This functionality is reused in full-sync and slot-migration. It is a decoupled utility, independent from the actions of save/sync/migration.
For a given task (e.g. save), a bgIterator object is created. bgIteration is designed such that multiple iterations can occur in parallel, each with a separate bgIterator object. (Think slot-migration with several simultaneous targets.)
Once active, the bgIterator sends a series of events (mostly key/value pairs) to the iteration client (on a background thread). The iteration client is responsible for sequentially processing each event. bgIteration is responsible for sequencing events, and all of the interactions with the running Valkey instance.
While a key is being used by bgIteration, it is ensured that the key is not modified by Valkey activity on the main thread. bgIteration intercepts each command. For “write” commands, which will update a currently in-use key, the Valkey client is blocked until bgIteration has completed working on the key. Most “read” commands are unaffected.
In CONSISTENT mode (save), bgIteration provides a point-in-time view.
- Each key in the database is marked with an “epoch”. The epoch is updated at key creation time and at each modification. If the epoch is older than the consistency point, the key will need to be passed to the iteration client when iteration reaches the key. If the epoch is newer than the consistency point, it’s either a new key (which didn’t exist at the consistency point) OR it’s a key that has been modified since the consistency point (and already processed).
- If a write command attempts to modify a key which has not yet been iterated, the Valkey client is blocked, and the key is expedited for immediate (out-of-order) processing. For simple/small keys, this is optimized by cloning the key/value and allowing the Valkey client to proceed immediately, without blocking.
In NON-CONSISTENT mode (full-sync and slot-migration), bgIteration allows Valkey clients to modify keys that haven’t yet been iterated. A bgIteration client can achieve eventual consistency (by the end of the iteration) by processing of curated replication (in addition to the key/value data).
- If a Valkey client performs a write to a key which has already been iterated, the bgIterator will pass the associated replication data to the iteration client.
- If a Valkey client performs a write to a key which has not yet been iterated, the operation is allowed to proceed unhindered (replication is NOT sent). Eventually, the updated key will be processed by iteration.
- It’s only when a key is actively in use by bgIteration that the Valkey client needs to be blocked.
bgIteration supports both consistent and non-consistent iterations, each with the option for replication data. This provides flexibility for future use cases. As an example, a non-consistent iteration, without replication, might be suitable for a statistical analysis scan operation.
Special considerations
- For Threadsave (the bgIteration client), saving of module aux-data (not key/value data) is a challenge. Prior to search capabilities, such data consisted of minimal meta-data. With search, we now see extensive indexes (possibly GBs in size). Most modules are not equipped to generate a point-in-time view of this data in an iterative manner. Saving of this data in a non-iterative manner would create potentially minutes of latency/impact. It is proposed that a fork-based mechanism might need to be used to support collection of the module aux-data. This would incur the cost of the initial fork, but would reduce the amount of copy-on-write (CoW) memory utilized as the forked process would be short lived (as compared to the lifespan of a complete database save). In the short-term, use of the search module may preclude the use of forkless save.
- Commands which address multiple keys must be handled specially in non-consistent mode. If only some of the keys (for the command) have been iterated, the command must be blocked while the remaining keys are expedited. This is necessary to ensure that replication may be properly processed.
- SWAPDB is handled/tracked by bgIteration. In consistent mode the database continues to be presented to the iteration client as it was before the swap operation. In non-consistent mode, the iteration client is notified of the swap. bgIteration will still ensure that each key/value is iterated only once, however, the new DBID is used in events.
- MULTI/EXEC blocks are pre-parsed, identifying all keys (and tracking DB swaps) so that the command can be blocked (or not) as appropriate. There is no possibility of a MULTI block causing blocking during execution.
- LUA scripts can be problematic. Without declared keys it’s impossible for bgIteration to block the client before execution if there is an in-use key. Even if the keys are declared, there can be an issue if a different DB is selected from within the script! In the case that a LUA script attempts a write to an in-use key which hasn’t been pre-declared (or unprocessed key in consistent mode), a synchronous wait is performed, blocking the main thread, until bgIteration has released the key.
- Active defrag is disabled during bgIteration operations. The main thread can’t reallocate items while they are being serialized in a background thread.
- Rehashing is disabled on the current kvstore through the creation of a “safe” iterator.
- Rehashing is disabled for complex items (hash, set, sorted set) while in use by bgIteration.
- FLUSHALL terminates iteration.
- FLUSHDB of a “small” DB results in a notification to the bgIteration client, but does not terminate the iteration. This supports the use case of saving a large database while one or more trivial databased are continually flushed.
Comparison with standard (forked) operations
Benefits of bgIteration/Threadsave:
- Far better memory footprint. Very little additional memory is required during the save operations. During full sync and slot-migration, replication is sent inline, eliminating client-output-buffer (COB) growth. Forked operations can incur >2x memory impacts - up to 2x for CoW memory, plus additional memory to maintain replication COB. This is the primary benefit of forkless save/sync/migration.
- Elimination of latency due to the forking operation. “fork” can take ~100-300ms in some cases.
- Background/forkless operations avoid the use of the COB. This results in one less memory copy for serialized data. In rare cases, Threadsave can be faster than BGSAVE because of this (especially with larger strings).
Benefits of forked save/sync/slot-migration:
- After the initial forking operation, the forked process has little impact on main thread operations. The main thread will only incur additional overhead for page copies. Assuming the forked process is running on a separate processer (and ideally with separate cache), impact to the Valkey main thread is minimal. Alternatively, bgIteration is directly observable on main thread CPU. There is a timer job that regularly feeds the background thread and this can cause small impacts to latency.
- Forked operations are usually faster end-to-end. Once forked, there is a dedicated process performing the operation. It’s typical for Threadsave to be 20-25% slower than a forked save.
- Forked operations are much simpler. This is better for code maintenance.
- Forked operations are better for dealing with large keys. For example, a 10GB sorted set will cause significant client blocking in the forkless approach as the key will be locked to write operations for an extended period of time.
- Forked operations are better for dealing with large module aux-data.
Recommendation: If there is no memory concern, forked bgsave is usually the better choice. However, if memory is constrained, a forkless option is usually superior. If there are large/complex keys and the application is sensitive to write latencies, forkless may not be a good option. Modules with large aux-data (e.g. Search) may preclude use of forkless operations.
Proposed user experience
- A configuration will be provided to enable the use of forkless operations. This will add 4 bytes (average) to each key/value to maintain the epoch value. This must be configured at startup, and is not modifiable.
- A configuration will be provided to default the bgsave operation to use the forkless save. The standard (synchronous) SAVE operation will not be affected.
- The BGSAVE command will be altered to allow an optional parameter indicating the type of save operation. The client will be able to explicitly request a forked or non-forked save. By default, the type of save will be chosen based on the configuration parameters.
- The metrics supported by bgsave will also be updated by forkless save. In addition, forkless save will provide additional metrics including current/last save type and others as requested/required.
- The output (RDB) will be fully compatible with the existing RDB files. However, internally, the keys may exist in a different order. A single DB is not guaranteed to be contiguous.
Proposed software components to be delivered
- FIFO - a time/space efficient FIFO queue, used by bgIteration but suitable for general use
- mutexqueue - a mutex wrapper around FIFO, used by bgIteration, but suitable for general use
- objectGetMetadata - a function and related code to support per-key metadata. This will support the 4-byte epoch on each key (but can be extended for other uses).
- bgIteration - the core logic for performing forkless iterations
- Threadsave - a bgIteration client which performs SAVE (to disk) and full-sync (to socket)
FIFO, mutexqueue, & bgIteration have unit tests. The unit tests for bgIteration are extensive (and necessary). These tests are currently written using gtest/gmock. While I recommend that we incorporate gtest/gmock as an option for unit testing (https://github.com/valkey-io/valkey/issues/2492) these tests could be redesigned to more closely match Valkey unit tests. Without gmock, this would require custom design, in bgIteration, to create a mocking framework to support the tests.
Threadsave requires integration testing. These are currently designed in python, and would have to be redesigned in TCL to match the existing Valkey pattern.
@soloestoy @enjoy-binbin I would like to review the previous comment in the weekly meeting, so would appreciate if you could review it and post any comments here.
I am good with the opt-in way, we don't implemented forkless, and i also don't have much experience in this area, so i can't comment too much at this point. But looking at the comment, everything look solid to me, i am in with the proposal, that would be great and people can enable it in memory constrained workload.
We once experimented with a forkless approach, which was quite different from these. We directly implemented an MVCC modification to the data structures. However, the MVCC solution turned out to be too complex and nearly infeasible to implement within modules, so we ultimately abandoned it.
At the time, the motivation for adopting MVCC was to avoid blocking clients during key replication or persistence. Our primary concern was that, compared to the drawbacks of traditional forking—especially in terms of latency—a fork causes only one predictable, deterministic latency spike at execution time, whereas forkless might introduce multiple unpredictable latency spikes. From what I understand about user requirements, multiple uncertain latency spikes—potentially long-lasting—are far more disruptive and less acceptable to business operations than a single, bounded delay.
We previously tested AWS's forkless solution. Thanks to @JimB123 for the detailed explanation of the implementation—it aligns closely with our earlier assumptions, using a background thread to iterate all the key-values while blocking some conflict write operations.
To be honest, I still have some concerns about this approach. It works reasonably well in read-heavy scenarios, but could have a significant impact under heavy write loads. We conducted tests on write-heavy workloads and closely monitored instance performance at the time. Frankly, when there were many concurrent writes, the instance was nearly unusable—write throughput dropped to only single-digit operations per second.
I really appreciate @JimB123 's summary in the section "Comparison with standard (forked) operations." Forkless is not perfect—it has many drawbacks as well. Both fork and forkless approaches have their own trade-offs. In my view, the downsides of forking are acceptable, especially given recent advances in mitigating latency spikes, such as the async-fork technique described in this paper: https://www.vldb.org/pvldb/vol16/p1033-chen.pdf. Therefore, to me, forkless does not seem like a high priority.
https://www.vldb.org/pvldb/vol16/p1033-chen.pdf
Are the improvements discussed in this paper in any Linux kernel?
We have write-heavy workloads and for us, the main problem with the fork approach is the CoW. We're running valkey in kubernetes pods and we need to provision 2X the maxmemory, just because CoW can be up to 2X, especially with a write-heavy workload. If a kubernetes pod runs out of memory, the OOM-killer will kill it, which is very bad. We don't have huge keys and we're not using the search module, so the solution appears to work for us. A concern is about the added complexity.
The latency of the fork syscall (some hundred ms) is acceptable for us. The service interruption we see during a failover is much worse (multiple seconds) and our applications needs to tolerate that too. But I'm also curious how it is going with upstreaming the async-fork to upstream linux kernel. The paper is interesting.
In our fork (Tencent Cloud), we do, some new machines (new Tlinux (Tencent Linux) system) is using the async-fork, we plan to use async fork in the future for all instances (we now have 5W+ instances are using the async-fork).
@soloestoy I'm taking your comment to mean you would be "OK" with implementing this if we did, but you just don't think it's all that important.
In the weekly meeting, we decided the next step was to move JimB's proposal to it's own issue and folks will do a deeper review. Concurrently @JimB123 will create a PR for review.
Concurrently @JimB123 will create a PR for review.
it's great, before this PR, I have a suggestion (also a requirement, haha) that we need to ensure observability of forkless, including statistics on client blocking time, memory expansion, and so on.
I'd say some kind of centralized control flow among full_sync/move/defrag etc makes sense, so we can prioritize slots and don't overwhelm service by copying too many slots in parallel.
Issue to be addressed by #2878
This issue to be closed.
(per core team discussion on 10/13/2025)