couchdb
couchdb copied to clipboard
Race condition causing slow responses to `PUT /db/doc` with `multipart/related` attachments
We have observed a cluster exhibiting a race condition in how PUT /db/doc with
Content-Type: multipart/related transfers attachment data to backend nodes. If
two such requests happen concurrently, it is possible for the second request to
hit a receive timeout that blocks the request from responding for 5 minutes.
This issue was observed on replication requests with new_edits=false and part
of the code path for this request differentiates between replicated_changes
and interactive_edit. However we don't believe this problem is specific to
replication requests.
We do not believe any data is corrupted by this interaction, but this issue highlights some paths that are worth checking for data races.
Relevant code
The following represents the tree of function calls that are relevant to this
issue; this trace references CouchDB v3.2.1. All-caps names indicate a process
boundary; the HTTPD, PARSER and FABRIC processes execute on the
coordinator node, while WRITER executes on the backend nodes holding the DB’s
shard files.
- (
HTTPD)chttpd_db:db_doc_req(#httpd{method='PUT', user_ctx=Ctx}=Req, Db, DocId)couch_doc:doc_from_multi_part_streamchttpd_db:send_updated_docchttpd_db:update_docspawn_monitor(FABRIC)fabric:update_docfabric:update_docsfabric_doc_update:gofabric_doc_update:handle_messagefabric_rpc:update_docs(WRITER; viarexi:cast)
Cluster setup
The cluster in question consists of 6 nodes spread over 2 data centres, with 3
nodes in each location. The database shard targeted by the PUT request is
replicated on 3 nodes, not all of which are in the same data centre.
Ping latency between nodes in the same datacentre is under 0.2ms, while between nodes in different data centres it's around 8ms.
Observations
By instrumenting the above code paths with more logging, we have observed the
cluster executing the following sequence of events. These events happen over two
requests, both are PUT /db/doc for the same doc ID and revision and have
attachments encoded as multipart/related.
-
Request 1 is received by
HTTPDon the coordinator node, which we'll call nodeA. This hits themultipart/relatedclause;Wis 2 andnum_mp_writersis set to 3. The shard targeted by the write is stored on 3 other nodes which we'll callB,CandD. -
This calls
couch_doc:doc_from_multi_part_stream, which callscouch_httpd_multipart:decode_multipart_stream, which starts thePARSERprocess. This process uses a closure as a state machine to invokemp_parse_doc,mp_parse_attsandmaybe_send_data.HTTPDsendsget_doc_bytestoPARSER, which replies withdoc_bytes. This causes the attachments to be converted into{follows, Parser, Ref}tuples. -
Meanwhile,
HTTPDstarts theFABRICprocess, which calls through tofabric_doc_update:go, which usesrexi:castto forward the incoming doc tofabric_rpc:update_docsin theWRITERprocess. This converts{follows, Parser, Ref}into afun/0, which is ultimately used inflush_dataviawrite_streamed_attachmentandread_next_chunk. -
This
fun/0sendsget_bytestoPARSER, which can handle it in several places: 1, 2, 3. (There is a lot of code duplication betweenmp_parse_attsandmaybe_send_datathat includes prioritisingget_bytesover other messages.)PARSERreplies withbytes. The code around this reply manages a buffer of attachment data; once a chunk has been read by all theWRITERprocs, it is discarded. The parser expects 3WRITERprocesses to request the data becausenum_mp_writersis set to 3. -
When
WRITERon nodesB,CorDreceivesbytesit sendsattachment_chunk_receivedtoFABRICon nodeA, which handles it infabric_doc_update. -
We observe one of the nodes, say node
B, doing several rounds of requesting data fromPARSERand acking it withattachment_chunk_receivedbefore the other two nodesCandDrequest anything.PARSERbehaves correctly, tracking which writers are waiting for data and how far through the stream each writer has got; it does not prematurely drop chunks as we previously observed in the "middleman" attachment handler. It appearsBis consuming the attachment much faster thanCandD. -
While
CandDare still fetching the attachment, we observe anokmessage from nodeBas it's finished reading the attachment and completed the write. So we have had one ack but we need one more to achieve quorum. -
At this point, an identical request, Request 2 is received by another
HTTPDprocess, also running on nodeA. Everything proceeds as above, spawningPARSERandFABRICprocesses. ThePARSERagain seesnum_mp_writersis 3 and so expects 3 processes to fetch the data before it drops the buffered chunks. -
In request 2, the first message that
FABRICreceives is anokfrom nodeB. We observePARSERreceivingget_bytesfrom nodesCandD, and those nodes sendattachment_chunk_receivedtoFABRIC. However, nodeBnever requests anything fromPARSER, and no other signal is received byPARSERthat would tell it that one of the writers already has the attachment data and doesn't need to fetch it. No writer process is started onBso there's not even aDOWNsignal thatPARSERcould detect. -
Because of this,
PARSERends up waiting for aget_bytesmessage that will never arrive, and ends up hitting anafter 300000timeout in itsreceiveblocks inmp_parse_attsormaybe_send_data. -
Up in
chttpd_db,send_updated_docreturns quickly; request 2 still achieves quorum from the rapidokfrom nodeBand an evantual ack from one of the other nodes. However we then callWaitFun(), which waits forPARSERto exit. This blocks request 2 from returning for 5 minutes, even though the write has completed correctly.
Comments
This situation is made more likely by the layout of the cluster; node B is in
the same data centre as node A and so has much lower latency than nodes C
and D. So it's likely to complete the write much quicker than the other nodes,
leaving a window where request 1 has one ack and doesn't have quorum yet when
request 2 starts.
Request 2 gets into this state because request 1 had partially succeeded; one of
the nodes completes the write but we don't have quorum yet. It is worth finding
out what would happen in request 2 if request 1 had had two nodes complete the
write when request 2 starts; even though this is enough for quorum, WaitFun()
causes the request not to return until all 3 nodes have read the attachment or
timed out. But, two acks is enough for quorum so it might put the doc in a
different state such that request 2 doesn't even try to write anything or
forward the attachment data.
For replication requests, this would not result in data loss, it's more of an efficiency concern. Replication requests are allowed to add revs anywhere in the tree, so the second request should just be a no-op. For interactive edits, the first request will change what the leaves of the rev tree are and might change the winning rev, so could cause request 2 to be rejected -- again we should examine what effect partial success of request 1 has here.
Finally we note that we currently have at least two methods for proxying
attachment data to backend nodes: the follows format from
multipart/related request, and the fabric_attachment_receiver format
with the "middleman" process. There are also two different ways of invoking the
intermediate funs depending on if the attachment length is known or
unknown. There's a lot of duplicate code for dealing with buffering the
attachment data to multiple backends and we recently discovered and fixed a bug
in the "middleman" implementation, also discovered on this cluster. It may
be worth designing a single implementation for this buffer system.
That's quite the report, thanks! When you say
These events happen over two requests, both are PUT /db/doc for the same doc ID and revision
are you referring to the base revision against which the update is being applied? If so, isn't this bit surprising?
In request 2, the first message that FABRIC receives is an ok from node B.
Without pulling up all the code modules I would have guessed that node B would reject request 2 as an edit conflict. Or were you running with new_edits=false here?
are you referring to the base revision against which the update is being applied? If so, isn't this bit surprising?
this is a function of the client app, where these two requests can happen concurrently from two different clients hitting the same exact coordinating node. A round-robin request assignment would make this occurring a lot less likely.
Or were you running with new_edits=false here?
Second para:
This issue was observed on replication requests with new_edits=false and part of the code path for this request differentiates between replicated_changes and interactive_edit. However we don't believe this problem is specific to replication requests.
are you referring to the base revision against which the update is being applied?
No, this is a replication request, so I mean that both are for the same document ID, and contain the same _revisions value. i.e. both requests are uploading the same version of the document.
Ah, thanks! Then the ok response from node B makes more sense, but I can see where we'd still run into the same issue of the PARSER waiting around for a receiver that was never spawned in the interactive case as well.
I guess the expedient fix here might for the FABRIC process to kill off the PARSER when it knows that a response has been received from every worker?
I agree that this attachment buffering system is overly complex and poorly designed (speaking as one of the designers). It'd be a pretty interesting project to design a greenfield replacement application and then figure out how to integrate that into CouchDB.
@kocolosk I wouldn’t want to introduce a layering violation here, but could another option be for the receiving node to send a no_bytes_will_ever_be_requested message to PARSER so it can decrement the workers it waits for?
send a
no_bytes_will_ever_be_requestedmessage to PARSER so it can decrement the workers it waits for?
@janl Yeah, I wonder if that would work. And we'd explicitly check for new_edits=false case only.
@kocolosk what do you think, would that work? In the interactive case I'd think we'd get back a conflict, or would we have the exact same problem.
@kocolosk This has prompted me to rethink how this would work for normal (new_edits != false) writes. For replication writes, both requests can succeed b/c replication writes can add revs anywhere in the tree; if the revs already exist, no problem.
For normal writes, one of these concurrent requests must fail, or you have a causality violation. Let's refer to the writer procs on B, C and D for requests 1 and 2 as B1, C1, D1 and B2, C2, D2. In our example above, B1 succeeds first, before request 2 begins. This means B2 must fail, because the rev param in these identical requests is no longer a leaf in node B's copy of the DB.
The rest of the processes C1, D1, C2, D2 are still ongoing and there's nothing forcing them to finish in any particular order. It should be the case that if C1 succeeds, then C2 fails and vice versa, same for the D writers. This in turn should mean that only one request achieves quorum, but it could be request 2, even if this is unlikely in general.
So this still should not result in a consistency violation, but it's worth seeing if there's any way this allows both requests to succeed.
Yes, that's correct @jcoglan. The question I've been mulling over for the past few minutes is whether an interactive edit conflict is detected by PARSER, or whether we have the same issue there because the attachment writers are never spawned. Should be pretty straightforward to test; in the worst case any multipart/related request that results in an edit conflict would hit the timeout, no data race required. But I was hoping to track down that logic in the codebase itself, and as you're well aware that's a challenging exercise for this bit of functionality.
I'm OK with the suggestion for the FABRIC coordinator process to inform the PARSER that a writer won't be requesting data. I'm still curious to understand what happens in the case of an interactive edit conflict.
Oh, and as to the consistency violation -- I strongly suspect we're immune, because the ultimate source of truth for whether the update is committed is in the normal couch_db_updater logic, where we've run all manner of tests stressing the correctness of the system under concurrent load. We could have plenty of cases where C1 and C2 are both writing attachment data to disk concurrently, but in the end the revision tree for a particular shard copy will only commit the update from one of them in interactive mode.
The question I've been mulling over for the past few minutes is whether an interactive edit conflict is detected by PARSER, or whether we have the same issue there because the attachment writers are never spawned
The thing to check for here would be that if request 2 gets one failing response from a backend, b/c that backend already accepted request 1, but it goes on to receive ok from the other two backends, will it respond quickly? Or will it hit the same timeout because the backend that rejected the write is not requesting the attachment data?
This probably depends on at what point the backend decides to reject the write, and whether it does this before or after fetching the attachment bodies. If it does it before, it needs to signal to the parser that it's not going to send get_bytes.
It turns out that the backend does reject the write before fetching the attachment bodies, but fabric:update_doc will throw the edit conflict, which means the handler in chttpd_db skips the WaitFun() and instead calls abort_multipart_stream, which causes the parser to consume and discard all the remaining attachment data on the wire.
So, that covers the interactive case. I still suspect that the new_edits=false case does not require this concurrent data race in order to trigger the timeout. It would be enough to submit the same request twice in a row; as far as I can tell at the moment the second request should hit the WaitFun() timeout.
I tried adding an integration test for this in e85e34359 but it doesn't do quite what you'd expect. I tried writing a doc with attachments and then saving that same revision again using new_edits=false. The client does receive a rapid response to both requests, but (I think) the attachment bytes are never consumed off of the wire and so the next request on the connection from a different part of the test suite hangs. The second new_edits=false multipart request never shows up in the logs because the test suite times out first and destroys the DB.
Perhaps this is also what you were observing in production?
At any rate, I added the test in part to demonstrate that the data race with the quorum logic might be a red herring. Anytime a multipart/related request with new_edits=false is made with a revision that already exists in the database the server does not properly clean up the connection.
That's a good test. If we run with n=1 only mode for integration tests, there could be a separate issue revealed where we drop the connection and never close it properly when attachments are ignored. I guess with n>1 it would happen as well, if all the nodes have the revision already because we would never pull the data bytes off the socket. We had tried before to force-close such connections to prevent them from being recycled [1]. Whether the client notices the close quickly enough is another issue...
In the case where n>1 and some nodes ready the attachment, for replicated_changes we perhaps would want to inspect the doc when we ignore it, and if it has any attachment parser processes, explicitly notify them in the case the update is dropped:
https://github.com/apache/couchdb/blob/e4b8a4624fc7ae09b4649aac9a8d68226208cd8b/src/couch/src/couch_db.erl#L1240-L1241
[1] https://github.com/apache/couchdb/blob/2c23d8e0fc8e44fa3f91228aa93002a9977e7221/src/couch/src/couch_httpd.erl#L988-L999 (I am afraid this never actually made it to the chttpd side!)
I guess with n>1 it would happen as well, if all the nodes have the revision already because we would never pull the data bytes off the socket.
Yes, I think so, although I'm now realizing that "pulling bytes off the socket" is not necessarily the issue in either the n=1 or n>1 case (with or without concurrent data races). The parser might well consume all the request data (and for small requests it probably does, even if no worker ever requests it), but the HTTPD process will still be stuck in WaitFun() after sending the response to the client. The connection stays open, and the client sends a new request on that connection, but HTTPD won't act on it until that 5 minute timeout fires.
I think this issue is fixed. Closing.