OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[RFC] Cloud-native OpenSearch

Open yupeng9 opened this issue 8 months ago • 33 comments

Is your feature request related to a problem? Please describe

Elasticsearch/OpenSearch, launched in 2010, was initially designed for on-premise deployments and smaller-scale applications, when cloud-computing was not prevailing at that time. This was shown in these observed design patterns:

  • Bundled Roles: Elasticsearch combined multiple roles (data storage, search, master node responsibilities) within a single node. This simplified setup and management for single-node or small clusters.
  • Mesh Design for Metadata: Cluster metadata was propagated to every node using a mesh design. This provided resilience against node failures as any node could provide the necessary information.
  • Data Movement Algorithms: Elasticsearch included algorithms to move data around the cluster. This was essential for rebalancing data in response to cluster growth, node additions, or node removals.

While this design initially provided convenience and ease of use, it has encountered significant challenges as cloud computing has become dominant and use cases have grown in scale and complexity:

  • Reliability Challenges at Scale: Managing very large Elasticsearch clusters (e.g., exceeding 200 nodes) becomes increasingly difficult. Data movement operations and cluster metadata replication can be time-consuming and lead to cluster pause.
  • Rolling Restarts and Upgrades: Performing rolling restarts or upgrades on large clusters can introduce latency spikes and potential data inconsistencies. Rollbacks can also be problematic. While Blue/Green deployments offer a safer alternative, they require significant additional capacity, which can be costly.
  • Challenges with Reactive Scaling: Dynamically adding or removing replicas in response to changes in serving traffic is challenging. Additionally, imbalances in traffic across shards can lead to uneven replica distribution, impacting performance and resource utilization.
  • Lack of Shard Distribution Control: Managing shard distribution in large clusters is complex. Fine-tuning shard size and distribution can be crucial for optimizing performance, especially for latency-sensitive applications.

We propose transitioning OpenSearch to a cloud-native architecture. In the cloud era, the shift towards cloud computing and the demands of modern, large-scale applications necessitate a rethinking of OpenSearch's architecture. A cloud-native architecture, leveraging cloud services and adopting a shared-nothing design, can address many of these challenges:

  • Cloud Components: Utilizing cloud-based services for storage, compute, and networking can provide greater scalability, elasticity, and resilience.
  • Shared-Nothing Architecture: In a shared-nothing architecture, nodes operate independently and do not share resources. This can simplify cluster management, improve scalability, and reduce dependencies between nodes.

By embracing a cloud-native approach, OpenSearch can better meet the requirements of large-scale, cloud-based deployments. This transition can lead to improved operational efficiency, enhanced reliability, and greater scalability, enabling OpenSearch to handle the demands of modern data-intensive applications.

Describe the solution you'd like

Goal

Remove the 200-node limit on cluster size to enable high scalability. Allow rolling upgrades and restarts with a 20% capacity buffer, eliminating the need for Blue/Green deployments. Support rapid horizontal scaling per shard to provide elasticity. Improve the reliability and reduce the operational work for large clusters Allow shard size and distribution fine tuning for better query performance

Architecture

The proposed architecture is shown as in Figure 1. And it includes several modular changes compared to the architecture today.

Image

Figure 1. Cloud-native OpenSearch architecture

External Metadata Storage

An important architectural change that can be made is the decoupling of logical metadata and physical cluster state, and storing the metadata in a shared external storage system. This would result in better scalability and faster access. Currently, cluster state propagation to all nodes can be slow in large clusters.

This idea was also proposed in an RFC that lists faster index creation/updates, better memory efficiency, and faster crash recovery among the benefits of this change. An additional benefit to external metadata storage is that it can simplify control plane implementation and delegate durable metadata state management to an external system, such as etcd/zookeeper, which are known to be reliable, distributed coordination systems. The RFC recommends RocksDB for metadata storage; however, RocksDB is not a distributed system and therefore does not provide the high availability and durability of strongly consistent KV storage systems such as etcd/zookeeper. Many technologies, including Kubernetes and Apache Kafka/Pinot/HDFS/Yarn, utilize etcd and zookeeper for metadata storage and coordination. As the Hadoop ecosystem is being replaced by more modern, cloud-native, and real-time data tools, it is proposed that etcd be the default external metadata storage compared to zookeeper. The metadata storage layer can be abstracted to allow for other storage options, such as zookeeper.

Additionally, etcd provides features such as watchers that are valuable for nodes in building reactive, event-driven features, including leader election, service discovery, and dynamic configuration.

Shared-nothing Data Nodes

Another important architecture change to enable cloud-native design is to make the data nodes shared-nothing. This means that each data node is independent and does not need to communicate with other data nodes. As shown in Figure 1, this is possible by the following changes:

  • Use segment-replication instead of the document-replication, so that the index segment is fetched from the remote storage (e.g. S3)
  • Use the external metadata storage as described above as the logical data and its goal state, so that a newly data node knows which index and shards that it serves along with other configurations. Unlike today’s design, the data node needs to fetch only the metadata relevant to it but not the global metadata. Also, the data node does not need to communicate with the master node to collect the cluster state.
  • Use a dedicated coordinator for serving the queries, so that the data node does not need to maintain the global cluster state for routing and forwarding the queries to other data nodes like today’s design
  • A pull-based ingestion (introduced in this RFC) can further enhance the shared-nothing design as it removes the need of translog, so a data node can quickly recover by replaying the messages from the streaming source. In fact, with the pull-based ingestion, we can even enable the leaderless (or primary-less) design because all the shard replicas are equivalent to the primary in terms of data ingestion capability.

This shared-nothing design brings in many benefits:

  • Horizontal scalability, the data nodes can be scaled out linearly and quickly without any shared bottlenecks. This is valuable for auto-scaling based on the traffic pattern.
  • Reduced contention, since the cluster state is not shared among data nodes, there is no waiting on its propagation
  • Better fault isolation. Since a data node does not depend on others, one data node crashing does not affect the other nodes, which improves the cluster resilience.
  • Easier operation. Since the data nodes become more modular, they are easier to operate and maintain. We can add/remove data nodes similar to microservices, and perform rolling restart/upgrade without the 2X capacity needed for Blue/Green deployment.

Controller with Goal State vs Actual State Convergence Model

With the metadata stored in an external storage system, we could simplify the master node design to adopt the goal state vs actual state convergence model, a design pattern proven to be resilient in large scale distributed systems with real-world examples like Kubernetes. A system that naturally drifts toward correctness, can provide strong self-healing capabilities and thus ease the operation and maintenance.

With the metadata stored in the external storage, the Controller can also store the goal state of the data nodes, such as the index and shard to serve for a given data node. The data node also writes its actual node state in the external storage such as the node health, configurations of the index it is serving. The controller can watch and compare the goal state and actual state, and update the goal state when needed, such as the leader selection of a shard to a given data node.

Stateless Coordinator

With the external metadata storage, it’s possible to make the coordinator a stateless service, so that it does not need to join the cluster like the data nodes . It can watch the metadata store and therefore it does not need to receive the cluster state from the master node. The cluster state includes the actual state of data nodes such as the shard allocation and node health, which can be used for building the routing table for the coordinator to dispatch the queries. A stateless coordinator can greatly simplify the deployment/operation, and it allows linear scalability of the coordinator horizontally.

Related component

Cluster Manager

Describe alternatives you've considered

No response

Additional context

This RFC is also written in this google doc for comments and discussion

yupeng9 avatar Apr 15 '25 23:04 yupeng9

Hi, @yupeng9, I think it's a great idea and would like to discuss it.

External Metadata Storage An important architectural change that can be made is the decoupling of logical metadata and physical cluster state, and storing the metadata in a shared external storage system. This would result in better scalability and faster access. Currently, cluster state propagation to all nodes can be slow in large clusters.

Will data nodes frequently access external Metadata storage (etcd/zookeeper) under the Cloud-Native architecture? Compared to the current Metadata stored locally on the node, I am concerned that frequent access to external metadata through the network will have a performance impact. (Compared to k8s, online search engines may face more metadata updates?)

guojialiang92 avatar Apr 17 '25 02:04 guojialiang92

Hi, @yupeng9, I think it's a great idea and would like to discuss it.

External Metadata Storage An important architectural change that can be made is the decoupling of logical metadata and physical cluster state, and storing the metadata in a shared external storage system. This would result in better scalability and faster access. Currently, cluster state propagation to all nodes can be slow in large clusters.

Will data nodes frequently access external Metadata storage (etcd/zookeeper) under the Cloud-Native architecture? Compared to the current Metadata stored locally on the node, I am concerned that frequent access to external metadata through the network will have a performance impact. (Compared to k8s, online search engines may face more metadata updates?)

This is a good question. Usually etcd allows thousands of clients to connect with 3k+ ops/sec for write and 20k+ ops/sec for read based on its use in kubernetes stack. This shall be sufficient to support a cluster of thousands of nodes based on the known industry use. Also, each data nodes can also cache metadata locally so that we can reduce the frequency of access and leverage watchers to be notified when changes are made. I'm also curious about kind of frequent metadata updates that you have in mind?

yupeng9 avatar Apr 17 '25 15:04 yupeng9

@yupeng9 Thanks for creating the proposal. If I am not wrong there has been several attempts being done in past to build a cloud native arch for OpenSearch(Ref: https://github.com/opensearch-project/OpenSearch/issues/13274 and another one you mentioned in your proposal, there can be more which I might be missing). This is a good proposal, but I want to understand what aer your thoughts on some of the core challenges for such arch:

  1. The current metadata/cluster state being stored in OpenSearch is stored as 1 big blob in RAM of master nodes, does this new arch solves that problem? Because I think its pretty critical for scaling for large number of nodes.

In the proposal I see readers and writers of a shard on same node? So with this proposal we are not solving the reader/writer separation and shard placement problem?

etcd where cluster state is getting stored will that be the only implementation? or we will keep it open for extension?

This is a good question. Usually etcd allows thousands of clients to connect with 3k+ ops/sec for write and 20k+ ops/sec for read based on its use in kubernetes stack. This shall be sufficient to support a cluster of thousands of nodes based on the known industry use.

This ops/sec would be at a specific size of the response. With > 200 nodes and say thousands of indices with large index mappings, will we get same ops/sec?

navneet1v avatar Apr 19 '25 11:04 navneet1v

  1. The current metadata/cluster state being stored in OpenSearch is stored as 1 big blob in RAM of master nodes, does this new arch solves that problem? Because I think its pretty critical for scaling for large number of nodes.

Agreed. The separation metadata and cluster state is part of this, and we will work on a spec to list the state and also how we can separate them. so that we can have a more finer-granular control which piece of metadata needs to be stored/fetched.

In the proposal I see readers and writers of a shard on same node? So with this proposal we are not solving the reader/writer separation and shard placement problem?

Reader/Writer separation is an ongoing effort, so it's orthogonal to this effort, though reader/writer separation is a good pattern for cloud-native architecture.

etcd where cluster state is getting stored will that be the only implementation? or we will keep it open for extension?

It will be an API, and etcd-based implementation will be one plugin and the likely one that we'll build the prototype with.

This is a good question. Usually etcd allows thousands of clients to connect with 3k+ ops/sec for write and 20k+ ops/sec for read based on its use in kubernetes stack. This shall be sufficient to support a cluster of thousands of nodes based on the known industry use.

This ops/sec would be at a specific size of the response. With > 200 nodes and say thousands of indices with large index mappings, will we get same ops/sec?

Right, this ties to the first question. If we always store a gigantic blob cluster state, then the throughput will be limited. But if we can store things in finer granular, and update the key of the changed metadata only, I believe we can improve the throughput. wdyt?

yupeng9 avatar Apr 21 '25 06:04 yupeng9

Hi @yupeng9 thanks for providing the response. Please find my response below:

  1. The current metadata/cluster state being stored in OpenSearch is stored as 1 big blob in RAM of master nodes, does this new arch solves that problem? Because I think its pretty critical for scaling for large number of nodes.

Agreed. The separation metadata and cluster state is part of this, and we will work on a spec to list the state and also how we can separate them. so that we can have a more finer-granular control which piece of metadata needs to be stored/fetched.

Super awesome. It will be good if we can mention this in proposal.

In the proposal I see readers and writers of a shard on same node? So with this proposal we are not solving the reader/writer separation and shard placement problem?

Reader/Writer separation is an ongoing effort, so it's orthogonal to this effort, though reader/writer separation is a good pattern for cloud-native architecture.

agreed.

etcd where cluster state is getting stored will that be the only implementation? or we will keep it open for extension?

It will be an API, and etcd-based implementation will be one plugin and the likely one that we'll build the prototype with.

Make sense. Love the idea of plugin will look to see more details on this.

This is a good question. Usually etcd allows thousands of clients to connect with 3k+ ops/sec for write and 20k+ ops/sec for read based on its use in kubernetes stack. This shall be sufficient to support a cluster of thousands of nodes based on the known industry use.

This ops/sec would be at a specific size of the response. With > 200 nodes and say thousands of indices with large index mappings, will we get same ops/sec?

Right, this ties to the first question. If we always store a gigantic blob cluster state, then the throughput will be limited. But if we can store things in finer granular, and update the key of the changed metadata only, I believe we can improve the throughput. wdyt?

If the metadata is split into smaller granular chunks then yes provided throughput should be achieved. Will be looking forward for the proposal on splitting the metadata into chunks.

I have more question/ask/thought and would like to know what you think about it: with the proposal mentioned in this GH issue will be able to achieve say millions of indexes in a cluster? Because in my mind the problem was always with metadata and shard management. I think with this proposal it should be resolved. If we are not thinking about this, can we take this as requirement for the current proposal.

navneet1v avatar Apr 21 '25 06:04 navneet1v

This is a good question. Usually etcd allows thousands of clients to connect with 3k+ ops/sec for write and 20k+ ops/sec for read based on its use in kubernetes stack. This shall be sufficient to support a cluster of thousands of nodes based on the known industry use. Also, each data nodes can also cache metadata locally so that we can reduce the frequency of access and leverage watchers to be notified when changes are made. I'm also curious about kind of frequent metadata updates that you have in mind?

Thanks, @yupeng9. The access frequency may depend on the number of nodes/number of indices/mapping size. In this architecture, we may no longer need to split more clusters for stability, so millions of indexes may be attractive to me.

I still have some questions I want to confirm with you:

  1. Is there still any form of communication between the master node and the data node in the Cloud Native architecture? In my understanding, it is not necessary. Under the new architecture, only the master node needs to upload the target state to the external Metadata storage, and the data node subscribes to the target state to the Metadata storage and reports the actual state.
  2. Splitting a big blob of metadata into finer grains is very helpful for the scalability of the cluster (Based on your discussion with @navneet1v). Also, I don't think this job relies on external metadata storage. In other words, are these two jobs orthogonal?

guojialiang92 avatar Apr 21 '25 08:04 guojialiang92

2. Splitting a big blob of metadata into finer grains is very helpful for the scalability of the cluster (Based on your discussion with @navneet1v). Also, I don't think this job relies on external metadata storage. In other words, are these two jobs orthogonal?

@guojialiang92 your thought is right splitting the metadata and external metadata storage is independent, but splitting metadata into smaller chunks actually makes the external metadata storage work simpler and scalable. This also ensures that clusters with large number of nodes can be created(>200 nodes), which was one of the main motivation for this proposal.

navneet1v avatar Apr 21 '25 09:04 navneet1v

Thanks, @yupeng9. The access frequency may depend on the number of nodes/number of indices/mapping size. In this architecture, we may no longer need to split more clusters for stability, so millions of indexes may be attractive to me.

Curious, millions of indexes is a lot, why do we need to support this much?

I still have some questions I want to confirm with you:

  1. Is there still any form of communication between the master node and the data node in the Cloud Native architecture? In my understanding, it is not necessary. Under the new architecture, only the master node needs to upload the target state to the external Metadata storage, and the data node subscribes to the target state to the Metadata storage and reports the actual state.

Yes, that's my feeling too that master node needs to update the goal state, and it does not need to handle the state propagation.

  1. Splitting a big blob of metadata into finer grains is very helpful for the scalability of the cluster (Based on your discussion with @navneet1v). Also, I don't think this job relies on external metadata storage. In other words, are these two jobs orthogonal?

Yeah, these two can be independent, but when we design the content of the external metadata storage, we need a spec of the metadata layout.

yupeng9 avatar Apr 22 '25 09:04 yupeng9

Thanks, @yupeng9. The access frequency may depend on the number of nodes/number of indices/mapping size. In this architecture, we may no longer need to split more clusters for stability, so millions of indexes may be attractive to me.

Curious, millions of indexes is a lot, why do we need to support this much?

@yupeng9 I don't think so millions of indices is a lot. If you look at systems like Turbopuffer they support tens of millions of indices. https://turbopuffer.com/docs/limits . The main use case of millions of indices is multi-tenancy and tenant isolation. Idea is rather than creating 1 single giant index with all the tenants data and then search using filters, user can create indices per tenant and then search based on that. This is very clean, easy to scale and maintain from a user side, provides awesome query latency.

So I believe with this proposal we should be able to achieve millions of indices. I do agree we might need more work in other part of OpenSearch to make this work, but as per my deep-dive and understanding splitting metadata into smaller chunks is one of the requirement to support multiple indices, followed by not keeping shards on the node(writable warm) to reduce inode usages.

navneet1v avatar Apr 22 '25 09:04 navneet1v

@yupeng9 I don't think so millions of indices is a lot. If you look at systems like Turbopuffer they support tens of millions of indices. https://turbopuffer.com/docs/limits . The main use case of millions of indices is multi-tenancy and tenant isolation. Idea is rather than creating 1 single giant index with all the tenants data and then search using filters, user can create indices per tenant and then search based on that. This is very clean, easy to scale and maintain from a user side, provides awesome query latency.

Yes, I agree multi-tenancy makes sense but millions of indices is still a lot.. Would that mean tens of thousands of customers sharding the same opensearch cluster? I feel the data size and cluster size will first run into the limit before the number of indices

yupeng9 avatar Apr 24 '25 04:04 yupeng9

Thanks, @yupeng9! This is a great idea.

I'm going to start documenting some of the APIs currently implicated in managing/applying cluster state to come up with a design that can cleanly abstract away the current cluster manager model.


The main components right now are ClusterService, ClusterApplierService, ClusterManagerService, and Coordinator. My preliminary view is that data nodes and coordinators only really need ClusterService (to read the current cluster state) and ClusterApplierService (to listen for cluster state changes). On the cluster manager side of things, we have ClusterManagerService handling cluster state changes and publishing them via a ClusterStatePublisher. The Coordinator is responsible for nodes joining/leaving the cluster, and provides the default ClusterStatePublisher implementation.

As a very hand-wavy implementation, I think we can run the "Controller" component that you described as a standalone cluster with only one cluster manager node. That would accept cluster state updates. We could inject a different ClusterStatePublisher implementation, where instead of broadcasting via Coordinator, it could publish cluster state to ETCD, Cassandra, DynamoDB, whatever. Meanwhile, the data nodes could have a dummy ClusterManagerService and Coordinator. Their implementation of ClusterApplierService would read the published cluster state. It's not perfect, but I think it might be enough to get started.

msfroh avatar Apr 28 '25 19:04 msfroh

@yupeng9 I don't think so millions of indices is a lot. If you look at systems like Turbopuffer they support tens of millions of indices. https://turbopuffer.com/docs/limits . The main use case of millions of indices is multi-tenancy and tenant isolation. Idea is rather than creating 1 single giant index with all the tenants data and then search using filters, user can create indices per tenant and then search based on that. This is very clean, easy to scale and maintain from a user side, provides awesome query latency.

Yes, I agree multi-tenancy makes sense but millions of indices is still a lot.. Would that mean tens of thousands of customers sharding the same opensearch cluster? I feel the data size and cluster size will first run into the limit before the number of indices

I disagree on part where we will run out of cluster size. Because you can still create a cluster with 1 index with 10Billion docs and you can also have a 100K docs per index will give us atleast 100K indices. and now with more than 200 nodes we can go upto 100B and more.

navneet1v avatar Apr 29 '25 17:04 navneet1v

I have an update on progress on a PoC for this. I'm pushing my changes to https://github.com/msfroh/OpenSearch/tree/clusterless_datanode.

  1. I made a relatively small change to let nodes start up without joining a cluster. Essentially, I commented out anything related to Discovery. Obviously a real implementation would need to make this stuff conditional.
  2. I have another, bigger change that adds an API to write cluster state to a node. I was able to push an index mapping to the node and initialize a shard. ~~Transitioning the shard from "initializing" to "started" still fails, but that's probably my fault.~~ Yup. My cluster state was bad. The human-readable primary_terms in index metadata is an object, but the API expects an array. Now I can start + search a shard with that commit. (No more code changes required.)

My thinking is that a controller can write cluster state to ETCD (or DynamoDB, Cassandra, etc). An agent running on data nodes (writers or readers) can watch for cluster state changes and publish filtered cluster state (with just the information the local node needs) to the local node. For a data node, that means only metadata for indices with shards hosted on the node, as well as a single node routing table for the current node. An agent running on coordinators will also publish cluster state, but it can exclude most index metadata (mappings, most settings). The coordinator needs a routing table with readers and their shards, so it knows how to fan out search requests.

I haven't decided if the agent should be part of the OpenSearch process. For my PoC, I'm playing the part of the agent using curl.

msfroh avatar May 07 '25 18:05 msfroh

looks cool. thanks!

An agent running on data nodes (writers or readers) can watch for cluster state changes and publish filtered cluster state (with just the information the local node needs) to the local node.

In certain environments, it might not be easy to deploy an agent on a host. Can we have a module/plugin in OpenSearch that can watch for cluster state change and invoke the API to make this more self-contained? Also ETCD is a key-value store, so we can also organize the cluster state as key-value pairs, so that we do not need the filtering but directly watch the relevant keys.

yupeng9 avatar May 07 '25 20:05 yupeng9

In certain environments, it might not be easy to deploy an agent on a host. Can we have a module/plugin in OpenSearch that can watch for cluster state change and invoke the API to make this more self-contained?

Yes, anything we do out-of-process can be done in-process too. In fact, doing it in-process should make the code changes a lot smaller -- we don't need to deserialize an XContent representation of the cluster state.

Also ETCD is a key-value store, so we can also organize the cluster state as key-value pairs, so that we do not need the filtering but directly watch the relevant keys.

Yes, that should be doable. In particular, I'm leaning toward the idea of the agent/plugin/per-node "thing" reading from the store and building the ClusterState object locally. That way, we're not forced to distribute OpenSearch's cluster state.

msfroh avatar May 08 '25 01:05 msfroh

Thanks @yupeng9 for starting the discussion.

Regarding the handling of cluster state in the cloud native architecture. There are multiple things to consider:

  1. Storing the state to a remote store and each data node fetches the state from remote storage. This already exists in OpenSearch. The backend storage is independent of the in-memory Cluster State data structure and in memory state is re-created using remote store objects. Metadata storage has its own versioning. Today, CM informs the data nodes that new state is available in remote store via publication (push) but you can build your watcher/ poller on the remote storage directly as well in the plugin (lets say).
  2. On-demand fetching of index metadata only for indices which are assigned to a particular data node. This is not implemented yet and discussed in Remote state RFC.
  3. Breaking the cluster state in granular objects like index metadata, shard routing allocation, snapshots, plugins custom metadata, data streams, etc. This would benefit for parallel processing of state and also handling of state in more scalable fashion where it is not treated as single blob. This would require changing the ClusterStateExecutors. I am not sure if it is attempted here.
  4. Refactoring the metadata management as a library which is not tied to in-memory cluster state object and can be plugged in easily for in-cluster cluster manager setup or in a cloud native distributed setup where it could be hosted as a multi tenant service as well.
  5. Eventually making Cluster Manager Stateless where any node or system can process the updates to cluster state by directly interacting with remote store as long as the remote store provides conditional writes semantics to ensure strong consistency. This would provide the infinite scale and not bound by limits of single cluster manager node.
  6. Make metadata and shard routing management agnostic of the OpenSearch version and plugin versions so that we don't end maintaining large no. of versions when hosting it in multi tenant fashion

@msfroh I have not looked at your code yet. Are you just looking to move cluster state store interaction to a plugin?

shwetathareja avatar May 08 '25 07:05 shwetathareja

Thanks for sharing the thoughts! I agree with these things to consider.

Thanks @yupeng9 for starting the discussion.

Regarding the handling of cluster state in the cloud native architecture. There are multiple things to consider:

  1. Storing the state to a remote store and each data node fetches the state from remote storage. This already exists in OpenSearch. The backend storage is independent of the in-memory Cluster State data structure and in memory state is re-created using remote store objects. Metadata storage has its own versioning. Today, CM informs the data nodes that new state is available in remote store via publication (push) but you can build your watcher/ poller on the remote storage directly as well in the plugin (lets say).
  2. On-demand fetching of index metadata only for indices which are assigned to a particular data node. This is not implemented yet and discussed in Remote state RFC.
  1. Breaking the cluster state in granular objects like index metadata, shard routing allocation, snapshots, plugins custom metadata, data streams, etc. This would benefit for parallel processing of state and also handling of state in more scalable fashion where it is not treated as single blob. This would require changing the ClusterStateExecutors. I am not sure if it is attempted here.

Yes, with this, we can store the granular objects as k/v pairs in etcd, so that we can watch their changes to be more efficient in state change reaction

  1. Refactoring the metadata management as a library which is not tied to in-memory cluster state object and can be plugged in easily for in-cluster cluster manager setup or in a cloud native distributed setup where it could be hosted as a multi tenant service as well.
  2. Eventually making Cluster Manager Stateless where any node or system can process the updates to cluster state by directly interacting with remote store as long as the remote store provides conditional writes semantics to ensure strong consistency. This would provide the infinite scale and not bound by limits of single cluster manager node.
  3. Make metadata and shard routing management agnostic of the OpenSearch version and plugin versions so that we don't end maintaining large no. of versions when hosting it in multi tenant fashion> > @msfroh I have not looked at your code yet. Are you just looking to move cluster state store interaction to a plugin?

yupeng9 avatar May 08 '25 17:05 yupeng9

@msfroh I have not looked at your code yet. Are you just looking to move cluster state store interaction to a plugin?

@shwetathareja, I was essentially planning to do the 6 things outlined in your comment. So far, I've confirmed that a data node can work based on some "synthetic" cluster state without actually being part of a cluster.

msfroh avatar May 08 '25 17:05 msfroh

@rajiv-kv has also started looking into Stateless Cluster Manager and as the first step he is looking into refactoring metadata lib. @rajiv-kv to share more details. I think we should collaborate here so that we are not spending duplicate effort here.

shwetathareja avatar May 09 '25 03:05 shwetathareja

Thanks @yupeng9 for a detailed proposal. I like the idea of providing extension points in cluster manager for leveraging cloud infrastructure.

I had few follow-up questions.

  1. With poll based metadata propagation, it is being suggested that eventual consistent (or even inconsistent) view of metadata is acceptable. However the existing design strives for strong consistency of metadata across nodes in cluster. Do you think this deviates from fundamental tenets of OpenSearch and there might be more components (other than cluster manager) that could potentially be impacted because this assumption does not hold true ?
  2. What are you thoughts around multi-tenancy of cluster manager for metadata management ? With on-premise, every customer gets own deployment, which will not be true in cloud. A single deployment of cluster manager might have to manage the metadata for multiple tenants.
  3. Index Stats is another information where nodes broadcast over transport to retrieve information from the individual nodes of cluster. Do you see the coordinator nodes also hosting admin endpoints of OpenSearch ? If so, we need to consider mechanisms for retrieval of metadata stats in cloud along with metadata.
  4. Some of the goals listed in RFC (200 node, reliability, large cluster state) have been addressed in Remote Publication, Async Shard Fetch and improvements of cluster manager in 2.17. Do you still see them as concerns ?

@msfroh - As part of state and routing persistence, there are interfaces exposed in cluster manager that could be leveraged for external metadata storage. Please take a look at PersistedState and its S3 implementation RemotePersistedState for details. Likewise, datanodes can download the state directly from remote with the entry point into RemoteClusterStateService. You could refer to this 15424 where admin APIs leverage persisted state from remote. I believe these interfaces can be extended to persist & retrieve state to etcd as well. Please take a look and let me know your thoughts on extending these interfaces for etcd based persistence.

As part of stateless cluster manager, I was thinking to initially focus on separating out persistence of index metadata from cluster state. This is more in lines with an existing proposal here 13197 to extract metadata-commons as library and use it across server module and cloud native implementations. Subsequently as next step, we could introduce interfaces around cluster state management and shard management to abstract out the existing implementation of task-queue and reroute in core. For cloud native , the interfaces could be implemented outside the core, where the coordination can be achieved by interacting with remote service(s). Currently these are in prototype phase and I will follow-up with a proposal on these two areas to get thoughts from the community.

rajiv-kv avatar May 09 '25 12:05 rajiv-kv

Thanks @yupeng9 for a detailed proposal. I like the idea of providing extension points in cluster manager for leveraging cloud infrastructure.

I had few follow-up questions.

  1. With poll based metadata propagation, it is being suggested that eventual consistent (or even inconsistent) view of metadata is acceptable. However the existing design strives for strong consistency of metadata across nodes in cluster. Do you think this deviates from fundamental tenets of OpenSearch and there might be more components (other than cluster manager) that could potentially be impacted because this assumption does not hold true ?

I want to better understand the consistency here. As the cluster manager talks to each node individually, the cluster state will be eventually consistent across the nodes, right?

  1. What are you thoughts around multi-tenancy of cluster manager for metadata management ? With on-premise, every customer gets own deployment, which will not be true in cloud. A single deployment of cluster manager might have to manage the metadata for multiple tenants.

I feel this is an orthogonal problem, as the metadata storage decoupling does not mean the cluster manager cannot manage the metadata for multiple tenants. However, the external metadata storage provides more flexibility of how this metadata is generated, and thus more plugins than the existing cluster manager for the control plane.

  1. Index Stats is another information where nodes broadcast over transport to retrieve information from the individual nodes of cluster. Do you see the coordinator nodes also hosting admin endpoints of OpenSearch ? If so, we need to consider mechanisms for retrieval of metadata stats in cloud along with metadata.

This is a good question, and it depends on how we position the coordinator nodes. I feel coordinators are mainly used by the data plane, i.e. read/write traffic. So it may not need the admin endpoints.

  1. Some of the goals listed in RFC (200 node, reliability, large cluster state) have been addressed in Remote Publication, Async Shard Fetch and improvements of cluster manager in 2.17. Do you still see them as concerns ?

Thanks for sharing those RFCs and it's good to see those improvements. We have not adopted those features yet, but we shall give it a try. This 200 node constraint stems both from our operational experience (e.g. large metadata slowing down the cluster) and the information we gathered from other companies

yupeng9 avatar May 10 '25 14:05 yupeng9

I've made more progress on my prototype (everything is pushed to my branch).

The workflow is now as folllows:

  1. You can push index metadata for each index independently to etcd, using the index name as a key and the JSON metadata as the value.
  2. For a data node, you can assign shards by writing a node config value to a key corresponding to the node ID. (In practice node IDs are probably not the best choice, but it works well for my prototype.) The node config value looks like {"local_shards":{"myindex":{"0":"PRIMARY", "1": "REPLICA" }, "some_other_index": { "0" : "SEARCH_REPLICA" } }}. The data nodes will only retrieve index metadata for the shards that they host.
  3. For a coordinator node, you don't need to provide any any index metadata. You just need to point from index names to the per-shard routing.

For both data nodes and coordinator nodes, on startup they register a watcher for their node ID and get notified when there's a change.

msfroh avatar May 14 '25 22:05 msfroh

@yupeng9 I don't think so millions of indices is a lot. If you look at systems like Turbopuffer they support tens of millions of indices. https://turbopuffer.com/docs/limits . The main use case of millions of indices is multi-tenancy and tenant isolation. Idea is rather than creating 1 single giant index with all the tenants data and then search using filters, user can create indices per tenant and then search based on that. This is very clean, easy to scale and maintain from a user side, provides awesome query latency.

Yes, I agree multi-tenancy makes sense but millions of indices is still a lot.. Would that mean tens of thousands of customers sharding the same opensearch cluster? I feel the data size and cluster size will first run into the limit before the number of indices

@yupeng9 Multi-tenancy is one use case, but I thought of other places too where millions of indices will be very helpful. Example:

So as a user of OpenSearch operating a hyper local business where I want to index information at zip-code/city level per index rather than doing in a single giant index and then using filters to do the search. So the solution becomes very promising because

  1. Now I can scale indices individually rather than scaling everything at once.
  2. Snapshots, merges, recovery, blast radius become very low as every index is encapsulated separately.

Let me know what you think on this.

navneet1v avatar May 15 '25 19:05 navneet1v

@navneet1v -- In theory, it would be possible to handle the kind of scale you're describing with this kind of solution. If each index has metadata stored in the central store (etcd, Cassandra, DynamoDB, etc), it becomes very easy to allocate shards "on-demand".

When an indexing or search request comes in, if the shard(s) to which the request would be routed are not currently assigned to any nodes, we could assign them, process the request, then unassign them (or treat it like an LRU cache, where they get evicted when the capacity is needed for something else). Since nodes only load metadata for currently-assigned shards, storing metadata for millions of indices is fine.

It's easy on the read side -- the write path is trickier because we need to make sure that two nodes aren't trying to write to the same shard, so there does need to be consistency around allocating primary shards.

msfroh avatar May 15 '25 20:05 msfroh

@msfroh Exactly my point and I tried to ask the same here: https://github.com/opensearch-project/OpenSearch/issues/17957#issuecomment-2817783634

I think this change is in the right direction. But my question is do you guys thinking around this kind of scale as a testing criteria for this feature?

navneet1v avatar May 17 '25 16:05 navneet1v

We already have a remote persistence layer that uses the underlying blob storage for state management. This isn't strictly using the conditional-write semantics but should be fairly easy to integrate. It provides the needed scalability/availability and durability properties, also it might be pretty straight-forward to build watchers on top using something like event notifications on blobs. Sorry I might have missed out on some threads, but I would like to understand what properties wouldn't be satisfied through the blob store based persistence layer

Bukhtawar avatar May 25 '25 06:05 Bukhtawar

Sorry I might have missed out on some threads, but I would like to understand what properties wouldn't be satisfied through the blob store based persistence layer

There's a much bigger problem to solve, where cluster state is a large, unwieldy mass, where every node receives the full cluster state, which contains a bunch of stuff that most nodes don't need.

While I realize that my branch has hundreds of lines of code and may be too much to read through, this class shows how we can propagate specific information to nodes that need it. Data nodes get index metadata only for the indices for which they host shards and know about no other nodes in the cluster. Coordinator nodes get enough information to route requests to data nodes, and nothing more.

My branch is in a "done for now" state, while I work on creating an initial implementation of the Controller, which can live outside of OpenSearch and won't share any code with it. ~~If anyone wants to try out my branch, I'm happy to write up a quickstart guide for running etcd locally and using that to control a node's behavior.~~ I wrote a short "Getting Started" guide. I'll try it out this evening on my Linux laptop to confirm that it works there too.

msfroh avatar May 27 '25 21:05 msfroh

@msfroh The remote cluster state implementation takes care of storing each IndexMetadata as separate object and all the IndexMetadata are tracked by ClusterMetadataManifest file in remote. Today, the download logic downloads the full or differential cluster state here. This can easily be updated to consider downloading only index metadata for shards assigned to that node. This is how we are planning to implement (in the OpenSearch core) lazy download of index metadata on the data nodes for the shards assigned to that node. ETCD can be added as another remote implementation (similar to S3) in the plugin as needed. We may need to make certain remote store interfaces for cluster state public which are marked internal. You can look at RemoteIndexMetadataManager for sample remote store interaction for read/ write.

shwetathareja avatar Jun 05 '25 07:06 shwetathareja

This can easily be updated to consider downloading only index metadata for shards assigned to that node.

Thanks, @shwetathareja!

I'm going to keep moving forward with my implementation for now, since it starts from the (IMO much more interesting) premise of "What if we could run OpenSearch without a cluster?" That's the fundamental mistake that was made 15 years ago -- the cluster model just doesn't scale. So, instead of slowly rolling out a series of patches to try to make the broken cluster model a little less painful, I'm going start from the point of not having a cluster at all. (As I said, my branch already does this, with 4 weeks of part-time effort by one person. See the README that I wrote to see what's possible so far.)

Over time, we can probably share components back and forth between the implementations. I don't believe we're trying to solve the same problem -- or if we are, we're taking radically different approaches and I believe that my approach is the correct one.

msfroh avatar Jun 05 '25 15:06 msfroh

I don't believe we're trying to solve the same problem -- or if we are, we're taking radically different approaches and I believe that my approach is the correct one.

I think we are trying to solve the same problem where metadata CRUD lifecycle and state processing is moved out of the cluster completely. The data and coordinator nodes fetches the required metadata on demand from a different storage. Yes definitely, everyone has the freedom to experiment with their implementation and innovate. But, I think its early to conclude which is correct and would require more broader discussion to conclude the same.

shwetathareja avatar Jun 06 '25 03:06 shwetathareja