Reduce intra cluster conflicts
Overview
CouchDB issues all write requests in parallel without coordination, applying a quorum on the results of those independent actions. When updating a document concurrently this can lead to the introduction of a stored conflict if two different writes reach separate nodes first. This is undesirable.
This patch changes fabric_doc_update in the following ways;
- Workers are no longer started immediately, but are given a unique reference each.
- For each range in the write request, one node is chosen to "lead" the write decision (calculated as the lowest live node that hosts the shard range)
- "Leader" workers are started.
- Any doc update that receives "conflict" from a Leader is added to the reply dict W times and the doc updates are removed from the other (unstarted) workers. If that leaves the worker with nothing to do, it is removed entirely.
Testing recommendations
There is some existing coverage in the module itself but more testing is needed before this can be merged.
Related Issues or Pull Requests
Checklist
- [x] Code is written and works correctly
- [ ] Changes are covered by tests
- [ ] Any new configurable parameters are documented in
rel/overlay/etc/default.ini - [ ] Documentation changes were made in the
src/docsfolder - [ ] Documentation changes were backported (separated PR) to affected branches
noting need for a liveness check at the start (so the 'leader' is always a live node as of the invocation) and also to consider maintenance mode (receipt of that message from a 'leader' specifically)
I put the acc introduction in its own commit for exactly that reason and intend to keep those separate commits when merging.
I put the
accintroduction in its own commit for exactly that reason and intend to keep those separate commits when merging.
That would look nice when merging, however since a bunch of notes and comments are related to just the acc change may still look nice to do a preliminary cleanup / #acc{} record PR and keep the the overall discussion concerning it in that PR then move on the main one.
it's hard enough to get CI to run for one pull request. I'll do it this time but I really don't like this idea at all. I separated out the refactoring from the functional change within the PR and it makes little sense to commit one without the other.
https://github.com/apache/couchdb/pull/5385 for acc record
it's hard enough to get CI to run for one pull request. I'll do it this time but I really don't like this idea at all. I separated out the refactoring from the functional change within the PR and it makes little sense to commit one without the other.
I think it's worthwhile. At least in this case the test failures do seem related to bulk_docs:
07:52:20 7) test bulk docs emits conflict error for duplicate doc `_id`s (BulkDocsTest)
07:52:20 test/elixir/test/bulk_docs_test.exs:124
07:52:20 Expected 201 and the same number of response rows as in request, but got
07:52:20 %HTTPotion.Response{
07:52:20 status_code: 500,
07:52:20 body: [
07:52:20 %{
07:52:20 "error" => "conflict",
07:52:20 "id" => "0",
07:52:20 "reason" => "Document update conflict."
07:52:20 },
07:52:20 %{
07:52:20 "error" => "conflict",
07:52:20 "id" => "1",
07:52:20 "reason" => "Document update conflict."
07:52:20 },
07:52:20 %{
07:52:20 "error" => "conflict",
07:52:20 "id" => "1",
07:52:20 "reason" => "Document update conflict."
07:52:20 },
07:52:20 %{
07:52:20 "error" => "error",
07:52:20 "id" => "3",
07:52:20 "reason" => "internal_server_error"
07:52:20 }
07:52:20 ],
I keep an eye on flaky failures in the CI and this not a test that usually fails. internal_server_error seems worrying there.
for sure, that indicates a bug here. It took 8 runs locally to get an internal_server_error so this will be fun to track down, but it needs to be found