couchdb-documentation icon indicating copy to clipboard operation
couchdb-documentation copied to clipboard

Add RFC on sharded changes

Open kocolosk opened this issue 3 years ago • 7 comments

Opening a PR for discussion on this RFC

kocolosk avatar Apr 06 '21 22:04 kocolosk

Thanks for the read Paul. The Intro was meant to convey the goals of the proposal, in priority order:

This proposal is designed to improve indexing throughput, reduce hot spots for write-intensive workloads, and offer a horizontally-scalable API for consumers to process the change capture feed for an individual database in CouchDB 4.0.

That said, I totally spaced out on the ebtree data structure here; I was still thinking of the model where view KVs are stored as KVs directly in FDB. So yeah, this is spot-on:

We're still left with the fact that writes to the view ebtree aren't currently parallelizable. If we want to make view updates faster, the first thing I'd investigate is how to upgrade ebtree to allow update-in-place semantics with the MVCC semantics that FDB offers.

kocolosk avatar Apr 07 '21 22:04 kocolosk

This seems to have been dropped, can someone confirm? I stand by my comment above fwiw, that adding sharding after all the effort to switch to FDB to get away from it is regrettable. I could only countenance this in couchdb 4.0 if there is no alternative and it is demonstrated to be essential.

rnewson avatar Jul 13 '21 15:07 rnewson

@rnewson I think that it's important to note that we are "only" partitioning the changes feed subspace -- not primary data nor indexes themselves. IIRC Indexes were the primary problem within the 3.0 scheme because of the ordered scatter-gather we needed to perform. I'm not convinced that the partitioning/sharding of the changes index space is something that is a bad thing, at least with regards to some of the performance problems that shards caused in 3.0.

I'd be curious what @kocolosk thinks about whether partitioning the changes feed brings back the bad stuff from CouchDB 3.0?

mikerhodes avatar Jul 13 '21 15:07 mikerhodes

Agree with @rnewson . Even if we switch the index storage format to allow paralelizable updates, adding a static Q would be a step back it seems.

One issue is at the user/API level. We'd bring back Q, which we didn't want to have to deal any more with FDB as a backend. And then in the code, we just removed sharding code in fabric, I am not too excited about bringing parts of it back, unless it's a last resort and nothing else works. We invent some auto-sharding of course, but that would be even more complexity.

It seems we'd also want to separate a bit better changes feed improvements vs indexing improvements. Could we speed up indexing without a static Q sharding of change feed with all the API changes involved and hand-written resharding code (epochs) and hard values?

I think we can, if we invent a new index structure that allow paralelizable updates. Like say an inverted json index for Mango Queries based on https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20171020_inverted_indexes.md.

The idea I had was to use locality API to split the _changes feed into sub-sequences, and either start a separate couch_jobs job (or just processes under a single couch_job indexer) to fetch docs, process and write to the index in parallel. So, if the _changes sequence looks like [10, 20, 25, 30], locality API might split them as [10, 20], [25, 30]. Then two indexers would index those in parallel. In the meantime the doc at sequence 20, could be updated to and now be at sequence [35]. Then we'd catch up from 35 to up the next db sequence and so on. The benefit there would be to avoid managing a static Q at all. The downside is it would work only for a write-paralelizable index and would only work if we "hide" the index being built in the background from queries (as it would look quite odd with as it wouldn't built in changes feed order). Then, once it's built, if we can update the index transactionally, we'd get consistent reads on it.

nickva avatar Jul 13 '21 16:07 nickva

It sounds like we've not exhausted other options other than re-adding sharding just yet. Mike, your point on multi-doc updates is a good one, and perhaps that needs discussing before we commit to this path.

The essential problem with multi-doc transactions is replication. Specifically, it would be a poor (or, at least, surprising) implementation of multi-doc transactions if those updates are not applied atomically when replicated, though it would be by far the easiest way to do it. Any form of replication enhancement to support this seems to have a significant scalability penalty. Not everybody would use multi-doc transactions, and those that would wouldn't necessarily do it for all updates, but building something we know will work poorly if used 'too much' is not a good idea (and particularly this thing, multi-doc txn, which we've explicitly prohibited for scalability reasons this whole time).

A future multi-doc txn option would complicate the work complicated in this RFC but doesn't seem to break it (though it diminishes its performance characteristics, all the way to zero in some cases).

Setting multi-doc txn aside, what are the limits of the current (relatively simple) changes implementation? How fast can we read the changes feed today? Can indexing read the existing changes feed from multiple places in parallel at runtime and still gain a performance boost?

rnewson avatar Jul 13 '21 17:07 rnewson

Yeah, transactions in the changes feed are a veritable "can of worms" with or without sharding, so @mikerhodes I think you were correct to call that one out of scope for this discussion. The last time I thought hard about it I came to the same conclusion that you did, which is that we'd want to build a separate transaction log rather than trying to turn the changes feed into said log.

@nickva I can see where the locality API would be a useful way to parallelize the initial index build while minimizing the load on FDB. Of course it doesn't help much with DBs that see a high write QPS in steady-state, since the core data model still concentrates all writes to the seq index in a single FDB shard. I'll freely admit that the write hot spot may not be the most urgent problem to solve; I think my rationale for the proposal was in part the idea of taking out a few birds with a single stone and providing a well-defined scale out story for other external _changes consumers.

I don't think I have the time at the moment to help implement this, and won't be offended if folks want to close it. We can always revisit later.

kocolosk avatar Jul 13 '21 22:07 kocolosk

@kocolosk Good point, the locality trick would be useful internally to say process the changes feed for the indexing but wouldn't help with write hotspots. The design the _changes feed external API is pretty neat and I think it may be worth going that way eventually but perhaps with an auto-sharding set up so that users don't have to think about Q at all.

Found a description of how FDB backup system avoids hot write shards https://forums.foundationdb.org/t/keyspace-partitions-performance/168/2. Apparently it's based on writing to (hash(version/1e6), version) key ranges, to have a balance between being able to query ranges but also avoid writing more than 1 second of data (by default versions advance at a rate of about 1e6 per second) to one particular shard at a time on average. Not sure yet if that's an idea we can borrow directly but perhaps there is something there...

Regarding changes feed being a bottleneck for indexing, we did a quick and dirty test by reading 1M and 10M changes on a busy cluster (3 storage nodes) and we were able to get about 58-64k rows/sec with just an empty accumulator which counts rows.

{ok, Db} = fabric2_db:open(<<"perf-test-user/put_insert_1626378013">>, []).
Fun = fun(_Change, Acc) -> {ok, Acc + 1} end.

([email protected])6> timer:tc(fun() -> fabric2_db:fold_changes(Db, 0, Fun, 0, [{limit, 1000000}]) end).
{16550135,{ok,1000000}}

([email protected])12> timer:tc(fun() -> fabric2_db:fold_changes(Db, 0, Fun, 0, [{limit, 10000000}]) end).
{156290604,{ok,10000000}}
....

For indexing at least, it seems that's not too bad. We'd want to probably find a way to parallelize doc fetches, and most of all concurrent index updates.

nickva avatar Jul 16 '21 23:07 nickva

I'm closing this ticket due the migration of the docs from apache/couchdb-documentation to apache/couchdb. Please create a new PR in apache/couchdb.

big-r81 avatar Oct 17 '22 10:10 big-r81