valkey icon indicating copy to clipboard operation
valkey copied to clipboard

[NEW] Data Tiering - Valkey as general data store not just in-memory data store

Open selemuse opened this issue 1 year ago • 18 comments

Now that Valkey is beginning to set its own course, wonder if Valkey can be enhanced to work as general data store, rather than just in-memory data store? It will widen the use-case for Valkey and remove those memory constraint.

selemuse avatar Mar 29 '24 10:03 selemuse

I'm not sure if my vote counts, but why, may I ask? Valkey/Redis is good at what it does, and we have plenty of battle-tested KV stores out there (e.g., HBase) that have stood the test of time. However, if the goal of this is to improve snapshotting to be "less visible" and more streamlined like in real databases, I'm for it :)

sanel avatar Mar 29 '24 14:03 sanel

If you truly want One System to Rule them All, just learn clickhouse and be at peace forever. Clickhouse has turned into basically what I wish redis would have been if it had proper management.

if anything, this project should try to have a more narrow focus before trying to grow again. feature deprecation would help more than trying to add more complexity. every new feature will have fewer users and take more time to implement and maintain.

"but what about backwards compatibility!" some people yell. What about it? Versions don't stop existing. Just run old versions if you need old versions for legacy systems. You'd be surprised how much industrial equipment still runs on Windows 3.1 out there.

I think what people actually want is a core "lightweight + high performance data management platform" they can extend for multiple use cases, but the current architecture mixes all concerns together. The unlimited mixing of concerns has become unmanageable. Every part of the system from networking, server, data structures, protocols, storage, replication, clustering are all mixed together where you can't do a major refactor to one component without touching almost all the other components too (which then breaks compatibility, so nobody does it).

Once nice thing about a non-profit foundation model is there's no push to constantly "grow" or "expand" or "capture market share." The project can just grind in the dark to be best.

[edit: also random idea after writing this: would it make sense to organize a new responsibility structure where there's a per-feature "leader" single person responsible for directing each logical component of networking, server, data structures, protocols, storage, replication, clustering, etc? The project has always had mainly an "everybody knows everything" organization which is great for micromanaging authority, but not the best for long term detailed feature growth in different areas or running concurrent development cycles over the long term.]

mattsta avatar Mar 29 '24 19:03 mattsta

I might misunderstand the idea of the issue's author, but what I understand under 'not just in-memory' is an ability to avoid data loss on eviction in critical situations when the real memory usage is higher than expected. I don't think it would be above Valkey scope ;)

But this is just an example important to me. I'd refer to three more general-purpose on-disk features from the world around Valkey:

https://docs.keydb.dev/docs/flash - KeyDB on Flash Storage https://docs.redis.com/6.4/rs/databases/redis-on-flash/ - Redis on Flash, a feature of Redis Enterprise in 6.x https://docs.redis.com/latest/rs/databases/auto-tiering/ - Auto Tiering, a feature of Redis Enterprise in 7.x

I'd be happy if I could see a similar feature (or even just the eviction alternative) it in the successor of Redis OSS. But this probably require more talks and some support for development :)

kam193 avatar Mar 30 '24 20:03 kam193

Yeah, it's one of those "would be fun/neat/interesting" ideas to have and technically fun to build over time, but it's worth considering if other projects already do this better and the effort to build new things vs. who would use it.

These systems have been tried different ways in the past (keep every key in memory, but note which are live vs. paged out to disk; or if a memory check fails check the disk index before returning failures, etc) and it also depends how much you want to optimize for storage systems like the specific flash notes you mentioned above.

Another amusing part is hardware risk. This also seemed like a fun project to explore when Optane was really taking off 8-10 years ago, but now Optane is EOL so all that work around programming for a specific single-vendor "persistent-RAM-reboot-safe" storage modality is just wasted. yay tech industry.

mattsta avatar Mar 30 '24 21:03 mattsta

@soloestoy you mentioned there are some services in China that do this as a budget alternative, because disk is cheaper than ram. How do these work? Just curious.

I generally agree with Matt that what we need is not more features. If we do less, we can do it better.

zuiderkwast avatar Mar 30 '24 22:03 zuiderkwast

I see a balance to achieve here between sound engineering and user requirements.

To use the OS analogy, I think we need something similar to the microkernel architecture, i.e., we need new features but we don't need to build all of them to the "kernel". I would imagine that we need a core engine, that owns the infrastructure such as networking, replication, (multi)threading, and core data structures (strings/lists/set/hash/...). This bare minimum system would cater to all standalone caching use cases. Features like cluster support, scripting (via Lua or other languages), etc should be built as part of this project but as modules only. This also includes the data tiering feature that @soloestoy explained before. By moving these features into their own modules, we will significantly reduce the coupling between the core engine and the modules and among modules themselves, hence speeding up the innovations in both the core engine and the new features (such as support "data tiering").

PingXie avatar Mar 31 '24 23:03 PingXie

Disks can offer larger capacity and lower cost, but disk-based storage and memory-based storage represent two very different development directions and research areas. Indeed, in China, there are many disk storage products compatible with the Redis protocol. They all use RocksDB as the disk storage engine, and then they have a coding layer that maps complex data structures to pure key-value pairs in RocksDB, for example:

image

And to enhance the efficiency of disk access, multi-threading is used to access the disk, which also introduces designs for concurrency control. Overall, this is a complex engineering task. Currently, using disk storage is not our top priority I think.

I also want to share some of my views on disk-based storage. Many people think that Redis/Valkey uses memory to store data, and since memory is volatile, it can lead to data loss and cannot guarantee data reliability. Only disk storage can ensure data reliability. I do not fully agree with this view.

First of all, although Valkey uses memory for storage, it also supports persistence. For example, when appendonly (AOF) is enabled, write commands are appended to the log, and even if the process crashes abnormally, data can be recovered from the AOF file. If there is a high demand for persistence, setting appendfsync to always can ensure that every write command is "immediately" flushed to disk. This is not too different from the Write-Ahead Logging (WAL) mechanism of traditional disk-based databases.

Furthermore, through the above methods, whether it is memory-stored Valkey or traditional databases stored on disks, they can only ensure the reliability of data on a single machine. If the machine crashes or the disk is damaged, data recovery is impossible. I believe data reliability relies more on replicas; storing data across multiple replicas to avoid data loss due to single points of failure.

However, data replication between primary and secondary replicas in a multi-replica setup is a serious topic. Currently, because we use an asynchronous replication mechanism, we cannot fully guarantee data consistency between primary and secondary replicas. There are data discrepancies between the primary and secondary replicas, which may lead to the loss of data not yet replicated to the secondary replica when the primary database crashes. Addressing the consistency issues between primary and secondary replicas is a challenge, but I believe it is a problem and direction we should focus on solving in the future in Valkey.

soloestoy avatar Apr 01 '24 03:04 soloestoy

@soloestoy I think you touched upon a few points that resonate with me really well. I see two high level requirements in this topic:

  1. cost efficiency a. I too see value in "data tiering" and RocksDB is a great/common engine option but I am sure there are other options too. b. My understanding of the "data tiering" value comes mostly from the Redis ecosystem (such as the client, tooling, and experience/expertise) and the cost benefits. It is a cost play at the end of the day. c. AOF also doesn't help with the "cost" ask. You still need to hold all of your data in RAM.

  2. data durability a. AOF does not provide true durability in an HA environment, even if the always policy is used. b. In the non-HA case, the AOF user sacrifices "availability" for "durability", also not an ideal situation. c. The lack of synchronous replication is indeed the first hurdle IMO that needs to be overcome in order to achieve RPO_0-like true durability. This is quite a departure from where Redis started, philosophically speaking, but IMO can be introduced as an opt-in mode.

BTW, the multi-threading support to improve disk access efficiency should be a separate concern from the core engine IMO. I can also see solutions like RocksDB could help in both cases (though not completely) though I feel helpful to look at the problems on their own first.

PingXie avatar Apr 01 '24 04:04 PingXie

Looking at docs, I find that there was an early feature "virtual memory" which was deprecated in Redis 2.6 and later removed. There's still a document about it in our doc repo: https://github.com/valkey-io/valkey-doc/blob/main/topics/internals-vm.md

zuiderkwast avatar Apr 29 '24 12:04 zuiderkwast

Reposting a previous comment from: https://github.com/valkey-io/valkey/issues/553

Yeah, naively mapping disk to memory doesn't work very well, since you see huge latency spikes when you have a virtual memory miss and need to fetch the memory page from disk. You can theoretically hide that if we were multi-threaded, since other threads would continue to get scheduled, but our single threaded architecture get's hurt too much by it.

A virtual memory like approach could work though if we built it ourself in userland. We could make a pretty minor change to the main hash-table to indicate that a key is "in-memory" vs "on-disk". Before executing a command, we can check if the command is in-memory, and if it is we execute the command normally. If it's on-disk, we can do one of two things:

Implement logic to go execute the command for an on-disk operation. I think this is similar to what Zhao mentioned with rocksdb in other threads. We fetch the data into memory, and once it's there we execute the command as normal. We would need a way to spill items to disk as well.

madolson avatar May 28 '24 17:05 madolson

Reposting another comment from #553.

With the "on-disk" flag per key, the key's name still consumes memory. I have another idea: We use a probabilistic filter for on-disk keys. If the key is not found in memory (main hash table) and the feature is enabled, then we check the probabilistic filter. If we have a match, we go and fetch the key from disk. This can allow a larger number of small keys on disk that what we even want to store metadata for in memory.

We can use new maxmemory policies for this. Instead of evicting, we move a key to disk.

If we implement some module API for these actions (evict hook, load missing key hook), then the glue to rocksdb or another storage backend can be made pluggable.

zuiderkwast avatar May 28 '24 18:05 zuiderkwast

Memory is expensive, Disk is cheap. I still remember one impressive word from Redis is that Redis is the fastest database in the world. With the feature storing data in disk, it is undouble accessing Valkey speed will decrease.

In Valkey, we have other high priority features to enhance, including how to achieve better data consistent among nodes (standalone mode and cluster mode) , better cluster architecture, better HA, etc.

And also, if we involved the data consistent codes in core, developers need take more time to maintain it. So i think now it is not a good time to touch this area unless it works an individual mode

hwware avatar May 28 '24 19:05 hwware

Reposting another comment from #553.

With the "on-disk" flag per key, the key's name still consumes memory. I have another idea: We use a probabilistic filter for on-disk keys. If the key is not found in memory (main hash table) and the feature is enabled, then we check the probabilistic filter. If we have a match, we go and fetch the key from disk. This can allow a larger number of small keys on disk that what we even want to store metadata for in memory.

We can use new maxmemory policies for this. Instead of evicting, we move a key to disk.

If we implement some module API for these actions (evict hook, load missing key hook), then the glue to rocksdb or another storage backend can be made pluggable.

You can play with keydb, compiling it with make ENABLE_FLASH=yes, then data will be stored in disk

hwware avatar May 28 '24 19:05 hwware

Memory is expensive, Disk is cheap. I still remember one impressive word from Redis is that Redis is the fastest database in the world. With the feature storing data in disk, it is undouble accessing Valkey speed will decrease.

In Valkey, we have other high priority features to enhance, including how to achieve better data consistent among nodes (standalone mode and cluster mode) , better cluster architecture, better HA, etc.

And also, if we involved the data consistent codes in core, developers need take more time to maintain it. So i think now it is not a good time to touch this area unless it works an individual mode

The way I look at it is that there should ideally be a knob that allows users to trade off between cost and performance (and consistency too at some point). The Redis/Valkey ecosystem (think of the client/tooling/etc) makes it attractive for (some) users to consolidate their workloads on Valkey. Not everyone needs sub-millisecond latency all the time but knowing that there is an option to get sub-millisecond if needed without switching to a different storage backend is very appealing IMO. Also this would reduce the need/complexity of running/maintaining two systems (a DB and a caching system).

It is indeed a departure from the project's caching root but there seems to be a significant amount of user interests in this area. So I think it at least warrants some deep-dive research. I am hopeful that data-tiering can be introduced in a relatively clean way.

PingXie avatar May 28 '24 19:05 PingXie

I also took a lot of inspiration from this paper, https://www.vldb.org/pvldb/vol6/p1942-debrabant.pdf, which sort of outlines this structure where "RAM" is the primary medium of storage, and we offload some data to disk as needed.

It is indeed a departure from the project's caching root but there seems to be a significant amount of user interests in this area.

I honestly think this type of disk based storage is more aligned with caching than many of the other beyond caching workload Redis came to be associated with. Caching is just a cost optimization game. Your working set is small, you're usually bottlenecked on network/cpu, if your working set is massive, the high premium for RAM will eat into your costs. If you can serve 80% of the request from an in-memory cache and 20% from a disk-based cache, you're probably still coming out ahead.

I am hopeful that data-tiering can be introduced in a relatively clean way.

Me too. We aren't going to make tradeoffs that hurt users. It's a great area to explore.

madolson avatar May 28 '24 21:05 madolson

I'd like to share some performance on 4K IO size testing: HDD: 100~200 IOPS, 10ms latency SSD: 50K IOPS NVME: read 500K IOPS, 80us latency; write 100K IOPS, 20 us latency. (The latest product has read 800K(4000X ~ 8000X of HDD) IOPS 60us latency, write 200K(1000X ~ 2000X of HDD) IOPS 10us latency)

From the point of my view, base on the modern backend storage, this topic may has better performance than the previous test.

The memory(RAM) latency of Intel modern CPU is 80ns, and AMD ZEN series have 120ns memory latency(because of multi-dies micro architecture). There is still a huge gap between RAM and NVME, simply mapping disk to memory is still not a good choice.

So I agree with @PingXie , if Valkey provides more feature to module, and Valkey runs as microkernel like engine, then it's possible to build any high performance storage engine based on the modern storage. (RocksDB may be an option, but I guess it would not be the best one on NVMe)

pizhenwei avatar Jul 09 '24 01:07 pizhenwei

another source of possible inspiration is https://github.com/apache/kvrocks

raphaelauv avatar Aug 26 '24 09:08 raphaelauv

Upvote on this! As said before: disk is dirty cheap, especially if you have a big ec2 fleet and have thousands of free nvme disks. Kvrocks is a great project.

ltagliamonte avatar Oct 11 '24 01:10 ltagliamonte

This would be an incredibly attractive feature. We primarily use redis for 2 use cases, caching and queued jobs.

For pure caching, we would only want an in memory data store for the performance, and the eviction policies can evict old / stale data that is no longer needed.

However, for queued jobs, this is data that we cannot loose otherwise we risk data loss. So if we have an unexpected spike of jobs, the instance can fill up and reject new jobs.

In this case sacrificing performance by allowing us to attach and use a cheap disk to extend the storage capacity would be amazing.

robertpcontreras-ts avatar Oct 24 '24 09:10 robertpcontreras-ts

I have a question on this: are there any contras against introducing "eviction-policy" option here? With options e.g. "drop" and "rocksdb" in order to just reuse all existing maxmemory-policies with eviction (7 of them) instead of adding another 1 separate maxmemory-policy (e.g. "allkeys-to-disk", or smth like that)?

It seems that it could be more attractive as in worst case another 7 maxmemory policies could be added, if all evict-involving ones are gonna be reused.

@zuiderkwast @soloestoy @PingXie @madolson

kronwerk avatar Oct 25 '24 14:10 kronwerk

I have a question on this: are there any contras against introducing "eviction-policy" option here? With options e.g. "drop" and "rocksdb" in order to just reuse all existing maxmemory-policies with eviction (7 of them) instead of adding another 1 separate maxmemory-policy (e.g. "allkeys-to-disk", or smth like that)?

@kronwerk Adding 7 new maxmemory policies seems bad, I agree.

But actually I think some of those combinations don't make sense. To move volatile keys with a very short TTL to disk seems like a very strange idea. How many combinations actually make sense?

We want to keep in memory the keys we predict will be accessed again soon, and move to disk the keys that we predict will not be accessed for a long time. I can only see allkeys-lru (and possibly allkeys-lfu) makes sense actually.

zuiderkwast avatar Oct 25 '24 19:10 zuiderkwast

@zuiderkwast some ideas about the implementation are below, pls take a look - in this case we can allow all 7 options, user is allowed to choose any behaviour.

Settings

  1. suggestion is that we integrate eviction to disk into existing mechanism, i.e. somewhere here: https://github.com/valkey-io/valkey/blob/unstable/src/db.c#L99 if done like that, we need a new parameter (e.g.) disk-evict-type ""[default]/rocksdb (needs build with key USE_ROCKSDB).
  2. max-disk-evict-ttl - maximum seconds of k/v pair living on disk; the idea it that while ttl is large the cost/benefit of placing on disk is high, while after some time that becomes useless; moreover, allowing user to change this setting gives some flexibility for different application domains; >=0, 0 = ignore limit, default = 300 (?).
  3. min-disk-evict-ttl - minimum seconds of k/v pair living on disk; >=0, 0 = ignore limit, default = 30 (?).
  4. min-value-size-for-disk-evict - minimal size of k/v pair for placing on disk; e.g. "set k v" gives "memory usage 56" in bytes, that doesn't sound good for placing on disk ever, as we have overhead for storing keys; >= some default (not sure about the exact value yet).
  5. disk-evict-cache - size of useful space for k/v pairs on disk (maxmemory analogue).

Functions 6. isDiskEvictMakesSense - evaluate is exact k/v pair looks feasible for disk placement:

  • if k/v pair <= min-value-size-for-disk-evict, do nothing;
  • if redisObject.restoredFromDisk for k is true, we don't want to place it back again.

Structs 7. EvictedValue - redisObject of specific type containing ttl info.

Changes to existing functions and structs 8. add restoredFromDisk (bool) to redisObject (for all data). 9. existing evict mechanism: if disk-evict-type != "" && isDiskEvictMakesSense(k):

  • if evict policy considers expire/ttl, i.e. volatile-xxx: we proceed if (min-disk-evict-ttl <= ttl) && (max-disk-evict-ttl == 0 || ttl <= max-disk-evict-ttl)
  • for other policies, i.e. allkeys-xxx, ttl = max-disk-evict-ttl if max-disk-ttl > 0 else -1; for -1 it won't be deleted, we assume that user knows why is that and gonna delete it in app directly.
  • run ExternalStorage::put(k, v, ttl) - on error we don't remove value from memory.
  • we don't delete key, just replace the existing redisObject with EvictedValue(ttl).
  1. on edit/delete: if value is EvictedValue, run ExternalStorage.drop(k).
  2. on read:
  • if value is EvictedValue, run ExternalStorage.get(k), store in memory, return result of get(k) from memory (to touch all the existing functional about lfu etc.).
  • if ExternalStorage.get(k) gives null (any reason), remove k from memory, client receives none.
  1. add to serverCron using run_with_period(1000) disk cleanup by ExternalStorage::evict(s, c), where s = null, с = 1000 (?), received (cur, i, err):
  • if i == 1000, interval is traversed in full, c could be raised (2000?).
  • if i > 0, next time run with s = cur (we need to process errors here specifically, no ideas exactly how right now).
  • if i == 0 and err == 0, next time run with s = null (restart from the beginning).
  • if i == 0 and err != 0, retry 3 (?) times, then restart with s == null.

Modules 13. EvictionToExternalStorage - abstraction layes for making addition of new storages later more smoothly. 13.1 ExternalStorage interface:

  • put(k, v, tll) -> errCode [write].
  • get(k) -> (uint8_t[], errCode) [read].
  • evict(s, с) -> (char[], uint8_t, errCode) [cleanup].
  • drop(k) -> errCode [delete]. 13.2 storage data object DictEntry:
  • v: uint8_t[] (somehow packed value).
  • timestamp: time_t (unix ts). 13.3 module is added in build with key USE_ROCKSDB. 13.4 choose of implementation depending on value of disk-evict-type: if ! = "", load this module; if == USE_ROCKSDB, use RocksDBStorage implementation.
  1. RocksDBStorage (interact with db through linked library, not as a standalone server): 14.1 ExternalStorage implementation:
  • put(k, v, ttl): write for k value [v, timestamp = time() + ttl if ttl >= 0 else 0]: -- return 0 on success/1 on error (needs some different error codes later, not sure which exactly now).

  • get(k): get value from disk: -- return (v, 0) if exists. -- return (null, 0) if no key or timestamp > 0 && time() > timestamp [in this case we run drop(k)] (needs some different error codes later, not sure which exactly now).

  • drop(k): remove data from disk: -- return 0 on success/1 on error (needs some different error codes later, not sure which exactly now).

  • evict(s,c): traverse c keys starting from s (RocksDB stores keys sorted, so it's better to move sequentially, not in random); if [doubtfully] there gonna be common code with get(k) make sure that traverse for cleaning check doesn't invoke nonconditional drop for every key: -- if s == null, it->SeekToFirst(); -- we can do DeleteRange for cumulative interval inside for optimization; -- run get(k) from Valkey here (!) for every key, if none - delete too; i.e. smth like that

  rocksdb::Iterator* it = db->NewIterator(rocksdb::ReadOptions());
  auto i = 0;
  for (it->Seek(start); it->Valid() && i < c; it->Next()) {
    i++;
    if it->value().Timestamp > time() || Valkey.get(k) == null {
      db->Delete(WriteOptions(), it->key());
    }
  }
  assert(it->status().ok()); // Check for any errors found during the scan
  delete it;

-- return (cur, i, 0) if i keys are traversed, no errors. -- return (cur, i, 1) if i keys are traversed, some error (needs some different error codes later, not sure which exactly now).

kronwerk avatar Oct 28 '24 11:10 kronwerk

The IOPS of todays NVMe SSDs is already very high and will likely improve further in the future.

Here are some references on modern NVMe SSDs:

  • https://www.vldb.org/pvldb/vol16/p2090-haas.pdf
  • https://dl.acm.org/doi/10.1145/3341301.3359628

I am not in favor of using RocksDB. To achieve the highest possible throughput, I believe we should take over disk I/O management directly (e.g., using libaio or io_uring). If io_uring pr got merged, things might become simpler.

I think the way Memcached supports this feature is quite good:,

  • https://github.com/memcached/memcached/blob/master/extstore.h
  • https://docs.memcached.org/features/flashstorage/

xiaguan avatar Dec 05 '24 14:12 xiaguan

I am not in favor of using RocksDB. To achieve the highest possible throughput, I believe we should take over disk I/O management directly (e.g., using libaio or io_uring). If io_uring pr got merged, things might become simpler.

This sounds good to me. We almost merged an IO-uring PR #599. In the end we didn't merge that PR, but it has some code (such as config to enable/disable iouring) that maybe can be used.

zuiderkwast avatar Dec 06 '24 23:12 zuiderkwast

I love the fact we are making progress acknowledging that a disk base approach is a valid use case. My use case is ML feature store, we picked kvrocks because we wanted to use ephemeral disks and lower our AWS elastic cache and crdb bill. Our kvrocks pods bootstrap from S3 their pre-sharded data. I just want to share few feedbacks after using kvrocks in prod for few months:

  • latencies for pure read workload (my use case) are really good: a bit worse then memory but better than crdb
  • compactions affects read performances
  • writes via the redis protocol pollute the caches and trigger compactions they are not suitable for bulk updates (my use case)
  • having db data on multiple files is really important for backup and restore (my use case is baked by S3 so It Is really important for me to be able to parallelize uploads and downloads)

Happy to share more here if you have any specific questions.

ltagliamonte avatar Dec 07 '24 01:12 ltagliamonte

I'll chime in say I've wanted a disk based (or spill to disk) KV store that:

  • Has Redis features like TTLs, Lua, cluster/sharding, etc
  • Can handle large objects (~100Mb)

And have found none to date.

adriangb avatar Dec 07 '24 02:12 adriangb

@adriangb oss dragonfly has redis cluster mode and data tiering -> https://www.dragonflydb.io/docs/managing-dragonfly/tiering

raphaelauv avatar Dec 08 '24 18:12 raphaelauv

@adriangb oss dragonfly has redis cluster mode and data tiering -> https://www.dragonflydb.io/docs/managing-dragonfly/tiering

@raphaelauv Interesting. I just want to point out that there are two problems with this statement:

  1. There is no OSS Dragonfly. Dragonfly is not open source. It's BUSL.
  2. Dragonfly has cluster mode, but it can't do automatic failovers without additional software, so does not provide high-availability out of the box. I found this on https://www.dragonflydb.io/docs/managing-dragonfly/cluster-mode

    There is one important distinction regarding Dragonfly cluster: Dragonfly only provides a data plane (which is the Dragonfly server), but it does not provide a control plane to manage cluster deployments. Node health monitoring, automatic failovers, slots redistribution are out of scope of Dragonfly backend functionality and are provided as part of Dragonfly Cloud service.

zuiderkwast avatar Dec 08 '24 21:12 zuiderkwast

I believe there are some high-level alignments we need to achieve before diving into the detailed implementation of data tiering.

First, from my perspective, "data tiering" is fundamentally a cost-saving strategy. The purpose of integrating non-volatile storage in this context is to reduce costs by offloading cold data to more affordable storage. It is not about "data durability", which is being addressed separately in #1355. This perspective aligns with @soloestoy's view in comment-2029099358 and @madolson's comments in comment-2136145280. I want to ensure this distinction is clear to the broader audience and that we are aligned on this point. In this sense, this approach is fundamentally different from kvrocks.

With this in mind, IMO, the high-level architecture naturally comprises two logical components. The first is the core engine, which manages in-memory data, handles evictions when memory is full, and fetches data back into memory upon cache misses. The second component is a non-volatile storage manager that efficiently handles data storage and retrieval, optimized for NVMe devices. This storage manager should prioritize large sequential I/Os, parallelize access, and minimize write amplification through batching and efficient data layouts.

To make the feature more general, I'd suggest that the eviction mechanism handle entire key-value pairs rather than just values. This ensures tiering works effectively even for cases where values are small. Additionally, incorporating Bloom filters in the engine core could help reduce unnecessary disk lookups and minimize I/O overhead, though this is an optimization detail.

The management of non-volatile storage itself should be delegated to a separate storage engine. This storage engine could be implemented as a module for better separation of concerns or integrated into the core engine as a compile-time option, depending on performance and maintainability needs. However, I believe the specific implementation details of the storage engine should be deferred until we reach consensus on the value proposition and high-level architecture.

To summarize my perspective on data tiering:

  • Value Proposition: Reduce costs by leveraging lower cost storage while maintaining reasonable performance, particularly by optimizing for high-performance NVMe devices.

  • High-Level Architecture: Adopt an "anti-caching" model where non-volatile storage acts as an extension of memory, which is the main storage.

@valkey-io/core-team FYI

PingXie avatar Dec 09 '24 06:12 PingXie

@PingXie I agree with this. I also think we should delegate it to modules to avoid taking the maintenance burden in the core. Let's design a module API and let module authors experiment with it.

Core engine just needs to provide roughly this:

  1. A pre-eviction hook where modules can hook in (or can this already be done using key space notifications?)
  2. A key miss callback, where modules can load a key when looking up a missing key. This should be done in the background, so the command (let's say GET) is not executed until the data is loaded. In the mean time, other commands from other clients can be executed.

The rest, including probabilistic filter for which keys are on disk, can be done in a module.

We need to think about SCAN, RDB dump and AOF rewrite too. Is it possible to iterate over the on-disk keys in scan order without any placeholder key in RAM?

zuiderkwast avatar Dec 09 '24 09:12 zuiderkwast