sonic icon indicating copy to clipboard operation
sonic copied to clipboard

Clustering capabilities (write master w/ read-only slave followers)

Open yuer1727 opened this issue 5 years ago • 21 comments

hello valeriansaliou: have you any plan to develop multi nodes version? I mean that highly available and scalable feature is import to be popular and be used in product environment.

yuer1727 avatar Apr 29 '19 12:04 yuer1727

No, not planned.

valeriansaliou avatar Apr 29 '19 12:04 valeriansaliou

Though if you're interested, I'm still open to contributions; if you can work it out in a simple way.

valeriansaliou avatar Apr 29 '19 12:04 valeriansaliou

Here's a simple idea for clustering, correct me if I am missing something:

Implement a "proxy" in front of 2+ nodes of sonic which would do two things:

  • Maintain channel connections to the nodes, proxying requests from it's own clients. Ingest requests go to EVERY node, Search requests go to a random node (round robin?).
  • Have a "snapshot" of either the whole index or replay-log on file in order to seed a new cluster node with data when it joins cluster

Bonus badass points for every node being able to act as such proxy elected by consensus algo instead of appointed-once proxy (which would become a single point of failure).

toxuin avatar May 23 '19 15:05 toxuin

That would be a pragmatic solution, but I'd be more open to a built-in replication system in Sonic, that may use the Sonic Channel protocol to 'SYNC' between nodes.

With the proxy solution, how would you implement synchronization of 'lost' commands when a given node goes offline for a few seconds? Same as for new nodes, with a full replay from the point it lost synchronization? A replay method would also imply that the proxy stores all executed commands over time, and ideally periodically 'compact' commands when eg. you have a PUSH and then a FLUSHB or FLUSHO that would cancel out the earlier PUSH.

A built-in Sonic clustering protocol could use database 'snapshots' to handle recoveries (I think RocksDB natively supports this), and only re-synchronize the latest part that's been lost on a given node due to downtime. I'll read more on how Redis and others do this as they have native clustering.

valeriansaliou avatar May 24 '19 14:05 valeriansaliou

A nice side effect of clustering, is that one could built a large Sonic-based infrastructure, with several "ingestor" nodes and several "search" nodes.

valeriansaliou avatar May 24 '19 14:05 valeriansaliou

I like the things you are proposing!

In any shape, clustering for fault-tolerance is a production-ready requirement – at least for us.

May I suggest to re-open this issue, even as an item for a roadmap item?..

toxuin avatar May 24 '19 15:05 toxuin

Sure, leaving it open now.

valeriansaliou avatar May 24 '19 15:05 valeriansaliou

Possibly relevant: Rocksplicator (article, repo).

I don't know how directly useful this would be as-is, but given that it's billed as a set of libraries for clustering RocksDB-backed services, I'd imagine at the very least there are some good lessons learned re: clustering strategy here.

benjamincburns avatar Jun 02 '19 22:06 benjamincburns

I think The Raft Consensus Algorithm will do great addition to Sonic.

Actix Raft can be integrated with Sonic to add clustering capability

Also Distributed Search idea can be borrowed from Bleve which is used by Couchbase

SINHASantos avatar Sep 16 '19 18:09 SINHASantos

We see many products built on rocksDB use RAFT successfully. (CoackroachDB, Arangodb etc)

SINHASantos avatar Sep 16 '19 18:09 SINHASantos

For simplicity's sake, the replication protocol will be inspired by Redis replication protocol: a primary write master, followed by N read slaves (ie. read-only). If the master falls down, reads are available on slaves but writes are rejected. When the master is recovered, slaves catch up to the master binlog and writes can be resumed by the connected libraries (possibly from a queue).

valeriansaliou avatar Jun 19 '20 10:06 valeriansaliou

my two cents , you may consider CRDT model used by Redis replication.

  1. Active-Active database replication will ensure Writes are never rejected
  2. All the servers are used and you make most of all the servers
  3. since there is no transaction and it is only search data -- there is no risk of data loss

SINHASantos avatar Nov 07 '20 18:11 SINHASantos

As a stop-gap workaround for missing clustering functionality, would it be possible to use a distributed file system for synchronizing changes between multiple Sonic instances and then (via a setting, somewhere) configure only one of the Sonic instances to have write access, leaving the other ones to be read-only?

I don't know if it would work to have multiple Sonic instances reading from the same files on disk, and what would happen to one Sonic instance if another instance writes to the files - but if that part works somewhat, it should "only" be a matter of implementing a read-only setting for this to work, I think.

Tenzer avatar Oct 24 '22 13:10 Tenzer

That would unfortunately not work as RocksDB holds a LOCK file on the file system. Unsure if RocksDB can be started in RO mode, ignoring any LOCK file. Can you provide more reference on that? If it's easily done, then it's a quick-wine.

However, properly-done clustering needs full-data replication, meaning that a second independant server holds a full copy of the data on a different file system. That way, if the master fails, then the slave can be elected as master, and the old master can be resynchronized from the old slave in case of total data loss on one replica.

valeriansaliou avatar Oct 24 '22 13:10 valeriansaliou

It looks like RocksDB can be opened in read-only mode. There's a page here that specifically talks about this: https://github.com/facebook/rocksdb/wiki/Read-only-and-Secondary-instances.

Tenzer avatar Oct 24 '22 13:10 Tenzer

Excellent, that can work then.

valeriansaliou avatar Oct 24 '22 13:10 valeriansaliou

may be support databases like TiDB ( built on rocks db) and extend the features of TiDB Cluster / replication

SINHASantos avatar Oct 26 '22 11:10 SINHASantos

Would it be worth it to move this to 1.5.0 milestone? I am open to taking a look at this and sending a PR based on this comment.

charandas avatar Jan 24 '23 22:01 charandas

Sure! Totally open to moving this to an earlier milestone. Open to PRs on this, thank you so much!

valeriansaliou avatar Jan 24 '23 22:01 valeriansaliou

I looked at the code a bit this evening. As I understand it, all the DB references (whether KV or FST) are opened lazily using the acquire functions in StoreGenericPool which the specific implementations call when there is an executor action requiring the use of the store. Tasker also accesses the stores lazily when doing its tasks.

I am assuming its straight forward enough to utilize the RocksDB secondary replicas as they allow catchup. When I say readonly below, I mean the Sonic replica, not RocksDB readonly replica.

My main question for now is to do with the syncing of the FST store? That doesn't come for free like RocksDB syncing. The executor currently keeps KV and FST stores in sync by performing pushes/pops/other operations on both stores in tandem. Writer will continue to do so, but for readers, should there be an evented system to broadcast all successfully completed write operations to them? On one hand, this approach does sound really complicated to implement and I wonder if constructing the FST by looking up all the keys in the synced KV store is an option on the reader's side?

Anyone else here who has thoughts about this? (I am a newbie 😄 in both Rust and Sonic).

charandas avatar Jan 25 '23 03:01 charandas

Hello there,

On the FST thing, as they are guaranteed to be limited in max size per FST, we could think about having a lightweight replication protocol work out insertions/deletions in the FST. If the replication loses sync, we could stream the modified FST files from master disk to slave disk, using checksums for comparison. Note that however, to guarantee integrity this would require to also send the not-yet-consolidated FST operations to the slave, so that next CONSOLIDATE on master and slave result in the same FST file being updated with the same pending operations.

Does that make sense?

valeriansaliou avatar Jan 28 '23 14:01 valeriansaliou