valkey icon indicating copy to clipboard operation
valkey copied to clipboard

[NEW] Multiple DB supports in cluster mode

Open madolson opened this issue 1 year ago • 84 comments

When moving from Standalone to Cluster, there are two API changes that end users need to consider: cross-slot commands and moving from multiple DBs to a single database. Although the cross-slot requirement is a requirement in order to make sure Valkey clusters scale, there is no similar requirement for DBs. The decision to only support one Database was an optimization, mainly to simplify the key to slot mapping.

This feature was not added in Redis since the old core team considered using multiple databases to be an anti-pattern compared to using prefixes. For example, instead of using database 1 and 0, you could have all keys prefixed with 0:: and 1:: and then build ACLs on top of that.

This use case works, but has some drawbacks. One common workload is loading in a fresh dataset into a secondary database and then performing a SWAPDB operation and then an async flush on the old data.

We will have some technical difficulty with implementing multiple databases now with the introduction of dict per slot, since we would have to duplicate all of the structure for each dictionary.

madolson avatar Nov 19 '24 02:11 madolson

Does this issue expect to support multiple databases in cluster mode?

I think this is a valuable feature. In production environments, many customers are accustomed to using the multi-DB feature. When migrating from standalone mode to a clustered setup, they expect the provider to offer this capability.

wuranxx avatar Nov 19 '24 02:11 wuranxx

Does this issue expect to support multiple databases in cluster mode?

Yes, thanks for you commenting. I made this as a placeholder to follow up about, so haven't fully added the details yet.

madolson avatar Nov 19 '24 02:11 madolson

I think this is a valuable feature. In production environments, many customers are accustomed to using the multi-DB feature. When migrating from standalone mode to a clustered setup, they expect the provider to offer this capability.

@wuranxx Do they use it as a multi tenant setup? I think we should also consider supporting first party ACL support for multiple DBs.

hpatro avatar Nov 20 '24 04:11 hpatro

the ability to flush or swap DBs

Although not currently implemented, we could use DBs as an abstraction to allow users to control and monitor individual workloads that are isolated from one another but collocated on the same Valkey cluster. I think it is a common use case to have the same cluster hosting data for many workloads/microservices. Maybe databases are an easier avenue to solving this use case than just prefixes?

  • It should be possible for the engine to collect per-DB stats (e.g. number of keys, memory footprint, etc). A command like DBINFO <db_num> could make it easy for users to monitor their server when they have many applications/use cases.
  • We could support certain configurations on a per DB level. E.g. perhaps you could configure maxmemory per DB with eviction to support isolation between workloads? Or maybe you might want to enable certain settings on one DB but not another?
    • In a cluster context, you could lock a DB to be hashed to a single slot (e.g. maybe its DB number/ID). For workloads that require cross-key commands, it could be a way without needing to manage hashtags in the client (logically, it is basically doing the same thing but with some syntactical sugar).

Sidenote: if we want to reverse direction on databases, I would like to float the idea of replacing the set number of DBs and the concept of a DB number (0-15) with a map from DB name to DB so users could create as many DBs as they want and name them how they please. By default, maybe DBs 0-15 are there, but perhaps you could DBCREATE <db_name> for additional DBs.

murphyjacob4 avatar Nov 20 '24 05:11 murphyjacob4

I think this is a valuable feature. In production environments, many customers are accustomed to using the multi-DB feature. When migrating from standalone mode to a clustered setup, they expect the provider to offer this capability.

@wuranxx Do they use it as a multi tenant setup? I think we should also consider supporting first party ACL support for multiple DBs.

Since redis/valkey has never implemented ACL control for databases, customers have not raised related requirements.

I believe that adding ACL support for databases would be a much larger requirement, and it’s necessary to reconsider the role of databases within valkey. Since databases have traditionally been regarded as an anti-pattern, there has been relatively little discussion on this topic.

wuranxx avatar Nov 20 '24 08:11 wuranxx

Although not currently implemented, we could use DBs as an abstraction to allow users to control and monitor individual workloads that are isolated from one another but collocated on the same Valkey cluster. I think it is a common use case to have the same cluster hosting data for many workloads/microservices. Maybe databases are an easier avenue to solving this use case than just prefixes?

I was suggesting the use case which @murphyjacob4 has called out. Different type of workloads on a single cluster using multiple DBs. This would warrant separate ACL rules for different workloads.

hpatro avatar Nov 20 '24 22:11 hpatro

+1 on this feature. It would remove one of the few the differences between cluster and standalone.

zuiderkwast avatar Nov 20 '24 23:11 zuiderkwast

ACL for DB numbers sounds good too but this is orthogonal to this feature I believe. No dependencies between the two.

zuiderkwast avatar Nov 20 '24 23:11 zuiderkwast

ACL for DB numbers sounds good too but this is orthogonal to this feature I believe. No dependencies between the two.

Thought of bringing it up as the stance had always been we don’t want to support multiple DBs, hence, acl support for multi dbs don’t need to be built. Let me file a separate issue to discuss about it.

hpatro avatar Nov 21 '24 00:11 hpatro

Sidenote: if we want to reverse direction on databases, I would like to float the idea of replacing the set number of DBs and the concept of a DB number (0-15) with a map from DB name to DB so users could create as many DBs as they want and name them how they please. By default, maybe DBs 0-15 are there, but perhaps you could DBCREATE <db_name> for additional DBs.

I long ago had thoughts on this somewhere in Redis, but they are probably lost to time. I really like this idea though. I mostly have been calling them namespaces, but we could call them databases as well (I'm still going to call them namespaces here though). By default everything is placed into the default "0" namespace. There is no explicit "create a namespace", you can simply just call SELECT my_namespace and it will be created. We would add a new ACL for namespaces for which ones you can select in to, like $my_namespace. I agree with the idea of having each namespace configurable for stuff like eviction policy, but maybe also defaults like TTL or triggers https://github.com/valkey-io/valkey-rfc/pull/9.

For cluster mode, I think it makes more sense to have namespaces be a clusterwide context instead of have them exist on a single shard. Cluster mode is inherently built to scale, constraining something to just one shard seems like a poor way to scale.

One of the reasons I want to differentiate namespaces, is that I think people already have assumptions about DBs that I don't really want to change.

madolson avatar Nov 21 '24 21:11 madolson

+1 to these features, adding multiple db support would make it possible to migrate from standalone to cluster mode, I also like the idea of having db as namespaces along with ACL support. This might make customers life easier of maintaining one cluster for different microservices/workloads with ACL.

roshkhatri avatar Nov 22 '24 21:11 roshkhatri

Feature parity with standalone is my primary motivation at the moment. While the current multi-db support is very limited, it has established itself in the ecosystem. So either we continue to push these users away from the cluster or we could help them migrate to clusters, by enabling the multi-db support.

I also like the idea of extending this further to provide better isolation, assuming we can maintain backward compatibility. But that feels like a logical next step. For now, looks like we are all aligned to enable multi-database support as-is?

PingXie avatar Dec 18 '24 17:12 PingXie

I am considering an approach to enable multi-database support in Valkey’s cluster mode. The idea is to keep the 16,384 hash slots database-agnostic, meaning that all databases share the same slot space (eg. slot X covers all databases). Cluster commands would operate within the context of the currently selected database, determined by the SELECT command. Additionally, the MIGRATE command, which already supports a database ID, could be leveraged for database-specific migration within the cluster.

This approach should be largely backward compatible.

Any thoughts?

xbasel avatar Jan 19 '25 21:01 xbasel

I think we should not mix multi-tenancy support with multiple databases. Multi-tenancy support would require both extending ACL functionality but also the ability to maintain QoS between tenants (over CPU, Network, memory etc..). this might be complicated task if we decide to do it well, for example maintaining fairness over activedefrag, evictions, expirations and many other cron jobs.
regarding the notion of using databases in order to provide better multi-tenancy support, I must say I do not see it as a good design. I absolutely agree ACLs should be able to limit user access to some dbs and I also think @madolson's suggestion to support either namespace or usergroup (which is more like cgroups in linux) is a good direction. For example I do think a specific tenant should be allowed to operate multiple databases but still be limited by a per-namespace/usergroup overall memory limitations.

I suggest we limit the scope of this implementation to naive multi-database support in cluster mode and tackle the multi-tenancy support as a separate issue.

ranshid avatar Jan 20 '25 08:01 ranshid

I am considering an approach to enable multi-database support in Valkey’s cluster mode. The idea is to keep the 16,384 hash slots database-agnostic, meaning that all databases share the same slot space (eg. slot X covers all databases). Cluster commands would operate within the context of the currently selected database, determined by the SELECT command. Additionally, the MIGRATE command, which already supports a database ID, could be leveraged for database-specific migration within the cluster.

This approach should be largely backward compatible.

Any thoughts?

The approach presented above keeps the current Valkey architecture the same:

+---------------------+
|      DB Array       |
+---------------------+
|   [0]   [1]   [2]   ...   [dbnum-1]  |
|    |      |      |                 |
|    v      v      v                 v
+----+   +----+   +----+   ...   +----+
| DB0 |  | DB1 |  | DB2 |       |DBN-1|
+----+   +----+   +----+   ...   +----+
  |        |        |                |
  v        v        v                v
+---------+---------+---------+      +---------+
| Slots   | Slots   | Slots   | ...  | Slots   |
| Array   | Array   | Array   |      | Array   |
+---------+---------+---------+      +---------+
| [0..n]  | [0..n]  | [0..n]  |      | [0..n]  |
+---------+---------+---------+      +---------+
  keys     keys     keys               keys

With large number of databases, we might need to implement lazy initialization instead of initializing the data structures at server init..

xbasel avatar Jan 20 '25 10:01 xbasel

@xbasel I think your approach sounds right. As you say, we probably need to lazy init the DBs.

@ranshid I agree multi-DB will never be true multi-tenancy. It can just provide something basic for cooperating tenants that trust each other. Just having FLUSHDB per DB is great when two related applications share the same nodes and one of them is doing a hard restart. We already have this use case internally.

We could incrementally add things like ACL, fairness, metrics and more in the future, as separate features, but IMHO a cloud provider should never use it to let different customers share the same Valkey nodes.

zuiderkwast avatar Jan 20 '25 10:01 zuiderkwast

@zuiderkwast I’m wondering if lazy loading databases is truly necessary. The maximum value of databases is INT_MAX, but it seems impractical to start Valkey with even a fraction of that. For instance, with databases set to 1 million (no cluster mode), memory consumption already reaches ~6GB on an empty database.

Additionally, I observed high CPU usage when the engine wasn’t processing any commands. Profiling revealed the following hot spots:

    97.88%     0.00%  valkey-server  valkey-server     [.] _start
            |
            ---_start
               __libc_start_main_impl (inlined)
               __libc_start_call_main
               main
               aeMain
               aeProcessEvents
               processTimeEvents
               serverCron
               |
                --97.83%--databasesCron
                          |
                          |--95.64%--activeExpireCycle
                          |          |
                          |           --76.92%--kvstoreSize
                          |                     |
                          |                      --0.54%--0xffffffffa5400f4b
                          |

The engine was barely usable. The conclusion is that while theoretically supporting a vast number of databases, the engine struggles in practice. If client awareness of these limitations is already assumed, is lazy loading necessary? Assuming most customers use an O(1) number of databases, memory complexity would remain O(1), rendering lazy loading less critical.

If there’s a use case where lazy loading should be implemented, it should be orthogonal to this change. It should apply to both cluster and non-cluster mode.

xbasel avatar Jan 20 '25 18:01 xbasel

The default config for dbnum is 16, right? So any cluster with the default config will now get 16 instead of 1 database. It can be a memory usage regression. How much more memory does 15 empty databases consume in cluster mode?

zuiderkwast avatar Jan 20 '25 21:01 zuiderkwast

The default config for dbnum is 16, right? So any cluster with the default config will now get 16 instead of 1 database. It can be a memory usage regression. How much more memory does 15 empty databases consume in cluster mode?

@zuiderkwast

The default dbnum is 16, meaning clusters now allocate 16 databases instead of 1, potentially increasing memory usage. Results:

1 vs. 16 Databases:
    used_memory_rss_human: 7 MB for both.
    used_memory_human: 1 MB (1 DB) vs. 10 MB (16 DBs).

100 Databases:
    16k slots: 57.45 MB.
    No slots: 940.43 KB.

1,000 Databases:
    16k slots: 564.21 MB.
    No slots: 1.44 MB.

100k Databases:
    16k slots: 54.99 GB.         RSS remained low: 221.88 MB.
    No slots: 58.21 MB.

RSS stays low, but virtual memory (used_memory_human) scales with database count, especially with slots. This is due to the OS's lazy loading mechanism, where memory is reserved (virtual) but only physically allocated when accessed.

If users typically use a low number of databases, should we rely on the OS's lazy memory allocation instead of implementing application-level lazy loading? (and even if they allocate large number of databases, would OS lazy loading be sufficient ?)

xbasel avatar Jan 21 '25 08:01 xbasel

@xbasel Interesting. I wonder if this is true in all OSes that we support and if the used_memory_human is important for users or not. Let's discuss these details in the PR and keep the issue for more high-level discussion. Feel free to do the simple solution without lazy. We can decide later if we need lazy or not.

zuiderkwast avatar Jan 21 '25 09:01 zuiderkwast

wonder if this is true in all OSes that we support and if the used_memory_human is important for users or not.

AFAIK many operating systems are managing virtual memory using demand paging (allocating pages based on access/pagefault) like all BSD invariants, macOS, mainframes, Solaris etc... I cannot see how this is not a bug in our current implementation. Migrating users to start consider the something else rather used_memory is practically a breaking change which would have a great blast radius IMO. not even starting to think about our maxmemory tracking which also relies on zmalloc accounting and not RSS. this is simply not the way this application works. It is true that counting on RSS is a valid solution, given a completely different application/system design, but Valkey has long been trying to base the memory accounting on the zmalloc system and jemalloc API.

I think this has to be addressed as part of this effort.

Some things that come to mind is why we see such a diff in RSS to used_memory? I thought it is due to that the kvstore is created with a fixed sized BIT which much be at the max slots num. however I see that it is allocated using zcalloc which is supposed to claim the pages...

@xbasel do you know what is the reason for the difference? maybe the issue is not that hard to fix?

ranshid avatar Jan 21 '25 18:01 ranshid

@ranshid For large allocations, allocators often use mmap(MAP_ANONYMOUS) instead of sbrk, which provides zeroed pages without explicitly writing to memory (it is differed). calloc likely utilizes this behavior for such allocations, though this should be double-checked.

       MAP_ANONYMOUS
              The mapping is not backed by any file; its contents are
              initialized to zero

https://elixir.bootlin.com/linux/v4.6/source/include/linux/highmem.h#L157 https://elixir.bootlin.com/linux/v4.6/source/mm/memory.c#L2767

I'm not sure if this issue is difficult to fix, but we essentially need to avoid preallocating databases and instead guard the array or whatever data structure we use to store database ids and allocate on demand. I assume this would require changes in many parts of the code.

This problem already exists, but introducing multiple databases in cluster mode would amplify it. If, in practice, no one has been affected by the current issue - likely because users typically use an O(1) number of databases, we should consider whether fixing this bug is truly necessary.

And if we decide to fix it, we can introduce logical database names. This could be possible if for example, we store database ids/names in a hashtable.

xbasel avatar Jan 22 '25 02:01 xbasel

For large allocations, allocators often use mmap(MAP_ANONYMOUS) instead of sbrk, which provides zeroed pages without explicitly writing to memory (it is differed). calloc likely utilizes this behavior for such allocations, though this should be double-checked.

   MAP_ANONYMOUS
          The mapping is not backed by any file; its contents are
          initialized to zero

https://elixir.bootlin.com/linux/v4.6/source/include/linux/highmem.h#L157 https://elixir.bootlin.com/linux/v4.6/source/mm/memory.c#L2767

@xbasel thank you. I am aware of how anonymous allocation works, but I am not sure how JEmalloc is operating it ontop of the OS.

I'm not sure if this issue is difficult to fix, but we essentially need to avoid preallocating databases and instead guard the array or whatever data structure we use to store database ids and allocate on demand. I assume this would require changes in many parts of the code.

In anycase I would suggest testing if placing the BIT creation under the KVSTORE_ALLOCATE_HASHTABLES_ON_DEMAND can solve SOME of the issue? I think in a 1K DBs the BIT should take total of 125MB (16K*sizeof(long long)*1000) but maybe there is some overhead?

This problem already exists, but introducing multiple databases in cluster mode would amplify it. If, in practice, no one has been affected by the current issue - likely because users typically use an O(1) number of databases, we should consider whether fixing this bug is truly necessary.

I also agree with what @soloestoy wrote here using so many databases (eg 100K) should be considered a bad pattern to follow, but the fact that we will consume 0.5GB for 1000 DBs sound bad. Maybe the BIT is the challenge to keep tight to the number of slots as it's size should be static.

ranshid avatar Jan 22 '25 05:01 ranshid

@ranshid let's continue discussing the lazy initialization issue on https://github.com/valkey-io/valkey/issues/1597

xbasel avatar Jan 22 '25 11:01 xbasel

Didn't read the discussion, resonding only with respect to the calloc...

I am aware of how anonymous allocation works, but I am not sure how JEmalloc is operating it ontop of the OS.

@xbasel \ @ranshid jemalloc uses memset to zero on calloc - see here for example[didn't check if it optimizes special cases, but usually the actual allocation is not directly consumed from mmap call].

zvi-code avatar Jan 28 '25 12:01 zvi-code

https://github.com/valkey-io/valkey/pull/1609 Is a per-requisite for this change.

xbasel avatar Feb 05 '25 13:02 xbasel

https://github.com/valkey-io/valkey/pull/1671

xbasel avatar Feb 05 '25 20:02 xbasel

Multi-database support in cluster mode - Implementation Plan

Description

Introduce multi-database support in Valkey cluster mode without breaking existing behavior. Commands like SELECT, SWAPDB, MOVE, and COPY should be valid in clustered deployments, removing the single-database constraint and matching standalone workflows.

Motivation / use-case

  • Standalone-to-Cluster migration: Eliminates the need to rewrite multi-DB logic when switching from standalone to cluster mode.
  • Consistent Workflows: Aligns the standalone and cluster usage models, so users can maintain identical commands and patterns across deployment types.

Tenets

  1. Full Backward Compatibility – Existing users must not be impacted. Behavior remains unchanged unless multi-DB is explicitly used.
  2. Existing Cluster Mode Clients Can Use Multi-DB Seamlessly – Enabling multi-DB should not require migrating or restructuring existing databases in cluster mode.
  3. Standalone-to-Cluster Migration – Multi-DB setups in standalone mode must be able to migrate to cluster mode.
  4. Large Number of Databases Is Not a Common Use Case – Slot migration may take longer as clients iterate over databases, but this is already the case in standalone setups when moving keys between nodes.

Key Features

  1. Database-Agnostic Hashing:

    • Keys map to the same slot across all databases (identical to the existing behavior).
      • For example, key x will map to the same slot, regardless of the database it is stored in.
    • Slot space unchanged. Slot distribution remains consistent, ensuring compatibility with existing setups.
  2. Backward Compatibility:

No API changes: Existing cluster commands and client integrations work unchanged. Single-database setups are unaffected.

  1. Cluster Management:

    • Most cluster management commands (e.g., CLUSTER SLOTS, CLUSTER NODES) remain global. They do not run in a selected database context
    • GETKEYSINSLOT, COUNTKEYSINSLOT and MIGRATE operate on the selected database context.
  2. Slot Migration:

    • No workflow changes for clusters running on a single database (DB0).
    • For clusters with multi-databases, iterating over the databases is needed to migrate all keys in all databases.
    • Valkey-cli resharding should be updated to handle multi-DB clusters.
  3. Memory Optimization:

    • Current Valkey implementation pre-allocates databases structures. In cluster mode, for each database, 16k slots are allocated. This introduces memory overhead and regression.
    • Lazy database initialization (via #1609) minimizes memory overhead for unused databases.

Implementation Details

Data structures: The existing array of databases (server.db) will not change. For every DB entry (database), there is an internal array of slot-based hashtables, as shown in the diagram below. Each hashtable represents one of the 16K cluster slots, so every DB ultimately contains 16K hashtables.

Image

Hashing - database agnostic: The slot calculation remains unchanged: a key’s hash slot is always determined by the same hashing procedure used in single-DB cluster mode. Now, instead of only referencing server.db[0].slots[...], the logic accesses server.db[N].slots[...] for the currently selected database N.

Command-Level Changes

  1. Cluster management commands remain global and operate in a global context rather than a specific database context. The exceptions are COUNTKEYSINSLOT and GETKEYSINSLOT, which retrieve or count keys from the slot belonging to the currently selected database instead of DB0.
  2. The MIGRATE command now runs in the context of the selected database. The destination-db parameter, currently used in standalone setups and always set to 0 in cluster mode, will now indicate which database the keys are transferred to on the target. This also enables cross-database transfers in cluster mode.
  3. SELECT / SWAPDB / MOVE / COPY will be modified to support cluster mode.

Replication: Nothing special here, all databases will be replicated the same way they are replication in standalone setups.

Usage There are no changes for customers using only DB0. Migrating keys/slots from one node to another is done like it is done today:

Source: CLUSTER SETSLOT <slot> MIGRATING <TARGET>
Target: CLUSTER SETSLOT <slot> IMPORTING <SOURCE>

Source: MIGRATE host port "" 0 <keys returned by GETKEYSINSLOT>

Source: CLUSTER SETSLOT <slot> node <TARGET>
Target: CLUSTER SETSLOT <slot> node <TARGET>

If multi databases are used, then the migrate keys/slots will be done as follows:

Source: CLUSTER SETSLOT <slot> MIGRATING <TARGET>
Target: CLUSTER SETSLOT <slot> IMPORTING <SOURCE>

Source: SELECT 0
Source: MIGRATE host port "" 0 <keys returned by GETKEYSINSLOT>

Source: SELECT 1
Source: MIGRATE host port "" 1 <keys returned by GETKEYSINSLOT>
.
.
.
Source: SELECT 15
Source: MIGRATE host port "" 15 <keys returned by GETKEYSINSLOT>

Source: CLUSTER SETSLOT <slot> node <TARGET>
Target: CLUSTER SETSLOT <slot> node <TARGET>

Alternatives considered

We've considered database-aware hashing, for example, expanding slot space where each database gets its slot space: Expanded Slot Space (16k per DB) Hash: Each DB has its own dedicated slot range (e.g., DB0 → 0–16383, DB1 → 16384–32767, etc.).

In theory it could provide better data isolation and perhaps easier migration and management (ie. isolate specific databases in specific shards), however, it is not backward compatible, it's a no-go.

PR: https://github.com/valkey-io/valkey/pull/1671

xbasel avatar Feb 06 '25 23:02 xbasel

Replying to https://github.com/valkey-io/valkey/issues/1681#issuecomment-2645954074 in this issue as the other one was closed

Thanks, @PingXie for your feedback!

I’m not sure I fully understand how this change would break the existing protocol. The existing command MIGRATE is not a cluster management command; it runs in both cluster mode and standalone mode and already operates within a selected database context: https://github.com/valkey-io/valkey/blob/unstable/src/cluster.c#L491 In cluster mode, clients always map to DB 0. Is CLUSTER MIGRATE a new proposed command?

Would you be able to provide a specific scenario where this change might break the existing protocol?

Regarding COUNTKEYSINSLOT and GETKEYSINSLOT, modifying COUNTKEYSINSLOT to return the total number of keys across all databases is feasible. However, changing GETKEYSINSLOT could be a breaking change, as it would require indicating the database number for each set of keys in the response.

As for atomic slot migration, I don’t currently see how this change would affect it. The new approach would work differently from the current protocol, and clients wouldn’t need to be concerned with databases since atomic migration should handle it internally - making it an implementation detail. Do you see any potential issues with this approach that I may have overlooked?

@murphyjacob4 @enjoy-binbin FYI

xbasel avatar Feb 09 '25 10:02 xbasel

@xbasel I think you're right about GETKEYSINSLOT, COUNTKEYSINSLOT and MIGRATE operate on the selected database context. I can see how it can work in a backward-compatible way.

valkey-cli can be modified to do SELECT for each database and migrate the keys. For atomic slot migration, I don't see any issues either, and if there are, there is time to fix them.

I can see only one potential problem for users: If some tool other than valkey-cli is used for slot migration, such as a kubernetes operator or some other control plane logic, that too needs some work. If clients start using SELECT and the admin is not aware, then slot migration would miss those keys in db > 0.

zuiderkwast avatar Feb 09 '25 15:02 zuiderkwast