couchdb Race condition causing slow responses to `PUT /db/doc` with `multipart/related` attachments

trafficstars

We have observed a cluster exhibiting a race condition in how PUT /db/doc with Content-Type: multipart/related transfers attachment data to backend nodes. If two such requests happen concurrently, it is possible for the second request to hit a receive timeout that blocks the request from responding for 5 minutes.

This issue was observed on replication requests with new_edits=false and part of the code path for this request differentiates between replicated_changes and interactive_edit. However we don't believe this problem is specific to replication requests.

We do not believe any data is corrupted by this interaction, but this issue highlights some paths that are worth checking for data races.

Relevant code

The following represents the tree of function calls that are relevant to this issue; this trace references CouchDB v3.2.1. All-caps names indicate a process boundary; the HTTPD, PARSER and FABRIC processes execute on the coordinator node, while WRITER executes on the backend nodes holding the DB’s shard files.

(HTTPD)
- chttpd_db:db_doc_req(#httpd{method='PUT', user_ctx=Ctx}=Req, Db, DocId)
  - couch_doc:doc_from_multi_part_stream
    - couch_httpd_multipart:decode_multipart_stream
      - spawn_monitor (PARSER)
        
        couch_httpd:parse_multipart_request
        
        couch_httpd_multipart:mp_parse_doc
        
        couch_httpd_multipart:mp_parse_atts
        
        couch_httpd_multipart:maybe_send_data
  - chttpd_db:send_updated_doc
    - chttpd_db:update_doc
      - spawn_monitor (FABRIC)
        
        fabric:update_doc
        
        fabric:update_docs
        
        fabric_doc_update:go
        
        fabric_doc_update:handle_message
        
        fabric_rpc:update_docs (WRITER; via rexi:cast)
        
        fabric_rpc:make_att_readers
        
        fabric_rpc:make_att_reader
        
        couch_db:update_docs(Db, Docs0, Options, replicated_changes)
        
        couch_db:doc_flush_atts
        
        couch_att:flush
        
        couch_att:flush_data(Db, Fun, Att) when is_function(Fun)
        
        couch_att:write_streamed_attachment
        
        couch_att:read_next_chunk

Cluster setup

The cluster in question consists of 6 nodes spread over 2 data centres, with 3 nodes in each location. The database shard targeted by the PUT request is replicated on 3 nodes, not all of which are in the same data centre.

Ping latency between nodes in the same datacentre is under 0.2ms, while between nodes in different data centres it's around 8ms.

Observations

By instrumenting the above code paths with more logging, we have observed the cluster executing the following sequence of events. These events happen over two requests, both are PUT /db/doc for the same doc ID and revision and have attachments encoded as multipart/related.

Request 1 is received by HTTPD on the coordinator node, which we'll call node A. This hits the multipart/related clause; W is 2 and num_mp_writers is set to 3. The shard targeted by the write is stored on 3 other nodes which we'll call B, C and D.
This calls couch_doc:doc_from_multi_part_stream, which calls couch_httpd_multipart:decode_multipart_stream, which starts the PARSER process. This process uses a closure as a state machine to invoke mp_parse_doc, mp_parse_atts and maybe_send_data. HTTPD sends get_doc_bytes to PARSER, which replies with doc_bytes. This causes the attachments to be converted into {follows, Parser, Ref} tuples.
Meanwhile, HTTPD starts the FABRIC process, which calls through to fabric_doc_update:go, which uses rexi:cast to forward the incoming doc to fabric_rpc:update_docs in the WRITER process. This converts {follows, Parser, Ref} into a fun/0, which is ultimately used in flush_data via write_streamed_attachment and read_next_chunk.
This fun/0 sends get_bytes to PARSER, which can handle it in several places: 1, 2, 3. (There is a lot of code duplication between mp_parse_atts and maybe_send_data that includes prioritising get_bytes over other messages.) PARSER replies with bytes. The code around this reply manages a buffer of attachment data; once a chunk has been read by all the WRITER procs, it is discarded. The parser expects 3 WRITER processes to request the data because num_mp_writers is set to 3.
When WRITER on nodes B, C or D receives bytes it sends attachment_chunk_received to FABRIC on node A, which handles it in fabric_doc_update.
We observe one of the nodes, say node B, doing several rounds of requesting data from PARSER and acking it with attachment_chunk_received before the other two nodes C and D request anything. PARSER behaves correctly, tracking which writers are waiting for data and how far through the stream each writer has got; it does not prematurely drop chunks as we previously observed in the "middleman" attachment handler. It appears B is consuming the attachment much faster than C and D.
While C and D are still fetching the attachment, we observe an ok message from node B as it's finished reading the attachment and completed the write. So we have had one ack but we need one more to achieve quorum.
At this point, an identical request, Request 2 is received by another HTTPD process, also running on node A. Everything proceeds as above, spawning PARSER and FABRIC processes. The PARSER again sees num_mp_writers is 3 and so expects 3 processes to fetch the data before it drops the buffered chunks.
In request 2, the first message that FABRIC receives is an ok from node B. We observe PARSER receiving get_bytes from nodes C and D, and those nodes send attachment_chunk_received to FABRIC. However, node B never requests anything from PARSER, and no other signal is received by PARSER that would tell it that one of the writers already has the attachment data and doesn't need to fetch it. No writer process is started on B so there's not even a DOWN signal that PARSER could detect.
Because of this, PARSER ends up waiting for a get_bytes message that will never arrive, and ends up hitting an after 300000 timeout in its receive blocks in mp_parse_atts or maybe_send_data.
Up in chttpd_db, send_updated_doc returns quickly; request 2 still achieves quorum from the rapid ok from node B and an evantual ack from one of the other nodes. However we then call WaitFun(), which waits for PARSER to exit. This blocks request 2 from returning for 5 minutes, even though the write has completed correctly.

Comments

This situation is made more likely by the layout of the cluster; node B is in the same data centre as node A and so has much lower latency than nodes C and D. So it's likely to complete the write much quicker than the other nodes, leaving a window where request 1 has one ack and doesn't have quorum yet when request 2 starts.

Request 2 gets into this state because request 1 had partially succeeded; one of the nodes completes the write but we don't have quorum yet. It is worth finding out what would happen in request 2 if request 1 had had two nodes complete the write when request 2 starts; even though this is enough for quorum, WaitFun() causes the request not to return until all 3 nodes have read the attachment or timed out. But, two acks is enough for quorum so it might put the doc in a different state such that request 2 doesn't even try to write anything or forward the attachment data.

For replication requests, this would not result in data loss, it's more of an efficiency concern. Replication requests are allowed to add revs anywhere in the tree, so the second request should just be a no-op. For interactive edits, the first request will change what the leaves of the rev tree are and might change the winning rev, so could cause request 2 to be rejected -- again we should examine what effect partial success of request 1 has here.

Finally we note that we currently have at least two methods for proxying attachment data to backend nodes: the follows format from multipart/related request, and the fabric_attachment_receiver format with the "middleman" process. There are also two different ways of invoking the intermediate funs depending on if the attachment length is known or unknown. There's a lot of duplicate code for dealing with buffering the attachment data to multiple backends and we recently discovered and fixed a bug in the "middleman" implementation, also discovered on this cluster. It may be worth designing a single implementation for this buffer system.

Feb 23 '22 12:02 jcoglan

That's quite the report, thanks! When you say

These events happen over two requests, both are PUT /db/doc for the same doc ID and revision

are you referring to the base revision against which the update is being applied? If so, isn't this bit surprising?

In request 2, the first message that FABRIC receives is an ok from node B.

Without pulling up all the code modules I would have guessed that node B would reject request 2 as an edit conflict. Or were you running with new_edits=false here?

Feb 23 '22 12:02 kocolosk

are you referring to the base revision against which the update is being applied? If so, isn't this bit surprising?

this is a function of the client app, where these two requests can happen concurrently from two different clients hitting the same exact coordinating node. A round-robin request assignment would make this occurring a lot less likely.

Or were you running with new_edits=false here?

Second para:

This issue was observed on replication requests with new_edits=false and part of the code path for this request differentiates between replicated_changes and interactive_edit. However we don't believe this problem is specific to replication requests.

Feb 23 '22 12:02 janl

are you referring to the base revision against which the update is being applied?

No, this is a replication request, so I mean that both are for the same document ID, and contain the same _revisions value. i.e. both requests are uploading the same version of the document.

Feb 23 '22 13:02 jcoglan

Ah, thanks! Then the ok response from node B makes more sense, but I can see where we'd still run into the same issue of the PARSER waiting around for a receiver that was never spawned in the interactive case as well.

I guess the expedient fix here might for the FABRIC process to kill off the PARSER when it knows that a response has been received from every worker?

I agree that this attachment buffering system is overly complex and poorly designed (speaking as one of the designers). It'd be a pretty interesting project to design a greenfield replacement application and then figure out how to integrate that into CouchDB.

Feb 23 '22 13:02 kocolosk

@kocolosk I wouldn’t want to introduce a layering violation here, but could another option be for the receiving node to send a no_bytes_will_ever_be_requested message to PARSER so it can decrement the workers it waits for?

Feb 23 '22 13:02 janl

send a no_bytes_will_ever_be_requested message to PARSER so it can decrement the workers it waits for?

@janl Yeah, I wonder if that would work. And we'd explicitly check for new_edits=false case only.

@kocolosk what do you think, would that work? In the interactive case I'd think we'd get back a conflict, or would we have the exact same problem.

Feb 23 '22 14:02 nickva

@kocolosk This has prompted me to rethink how this would work for normal (new_edits != false) writes. For replication writes, both requests can succeed b/c replication writes can add revs anywhere in the tree; if the revs already exist, no problem.

For normal writes, one of these concurrent requests must fail, or you have a causality violation. Let's refer to the writer procs on B, C and D for requests 1 and 2 as B1, C1, D1 and B2, C2, D2. In our example above, B1 succeeds first, before request 2 begins. This means B2 must fail, because the rev param in these identical requests is no longer a leaf in node B's copy of the DB.

The rest of the processes C1, D1, C2, D2 are still ongoing and there's nothing forcing them to finish in any particular order. It should be the case that if C1 succeeds, then C2 fails and vice versa, same for the D writers. This in turn should mean that only one request achieves quorum, but it could be request 2, even if this is unlikely in general.

So this still should not result in a consistency violation, but it's worth seeing if there's any way this allows both requests to succeed.

Feb 23 '22 15:02 jcoglan

Yes, that's correct @jcoglan. The question I've been mulling over for the past few minutes is whether an interactive edit conflict is detected by PARSER, or whether we have the same issue there because the attachment writers are never spawned. Should be pretty straightforward to test; in the worst case any multipart/related request that results in an edit conflict would hit the timeout, no data race required. But I was hoping to track down that logic in the codebase itself, and as you're well aware that's a challenging exercise for this bit of functionality.

I'm OK with the suggestion for the FABRIC coordinator process to inform the PARSER that a writer won't be requesting data. I'm still curious to understand what happens in the case of an interactive edit conflict.

Feb 23 '22 16:02 kocolosk

Oh, and as to the consistency violation -- I strongly suspect we're immune, because the ultimate source of truth for whether the update is committed is in the normal couch_db_updater logic, where we've run all manner of tests stressing the correctness of the system under concurrent load. We could have plenty of cases where C1 and C2 are both writing attachment data to disk concurrently, but in the end the revision tree for a particular shard copy will only commit the update from one of them in interactive mode.

Feb 23 '22 16:02 kocolosk

The question I've been mulling over for the past few minutes is whether an interactive edit conflict is detected by PARSER, or whether we have the same issue there because the attachment writers are never spawned

The thing to check for here would be that if request 2 gets one failing response from a backend, b/c that backend already accepted request 1, but it goes on to receive ok from the other two backends, will it respond quickly? Or will it hit the same timeout because the backend that rejected the write is not requesting the attachment data?

Feb 23 '22 16:02 jcoglan

This probably depends on at what point the backend decides to reject the write, and whether it does this before or after fetching the attachment bodies. If it does it before, it needs to signal to the parser that it's not going to send get_bytes.

Feb 23 '22 16:02 jcoglan

It turns out that the backend does reject the write before fetching the attachment bodies, but fabric:update_doc will throw the edit conflict, which means the handler in chttpd_db skips the WaitFun() and instead calls abort_multipart_stream, which causes the parser to consume and discard all the remaining attachment data on the wire.

So, that covers the interactive case. I still suspect that the new_edits=false case does not require this concurrent data race in order to trigger the timeout. It would be enough to submit the same request twice in a row; as far as I can tell at the moment the second request should hit the WaitFun() timeout.

Feb 23 '22 17:02 kocolosk

I tried adding an integration test for this in e85e34359 but it doesn't do quite what you'd expect. I tried writing a doc with attachments and then saving that same revision again using new_edits=false. The client does receive a rapid response to both requests, but (I think) the attachment bytes are never consumed off of the wire and so the next request on the connection from a different part of the test suite hangs. The second new_edits=false multipart request never shows up in the logs because the test suite times out first and destroys the DB.

Perhaps this is also what you were observing in production?

At any rate, I added the test in part to demonstrate that the data race with the quorum logic might be a red herring. Anytime a multipart/related request with new_edits=false is made with a revision that already exists in the database the server does not properly clean up the connection.

Feb 23 '22 18:02 kocolosk

That's a good test. If we run with n=1 only mode for integration tests, there could be a separate issue revealed where we drop the connection and never close it properly when attachments are ignored. I guess with n>1 it would happen as well, if all the nodes have the revision already because we would never pull the data bytes off the socket. We had tried before to force-close such connections to prevent them from being recycled [1]. Whether the client notices the close quickly enough is another issue...

In the case where n>1 and some nodes ready the attachment, for replicated_changes we perhaps would want to inspect the doc when we ignore it, and if it has any attachment parser processes, explicitly notify them in the case the update is dropped: https://github.com/apache/couchdb/blob/e4b8a4624fc7ae09b4649aac9a8d68226208cd8b/src/couch/src/couch_db.erl#L1240-L1241

[1] https://github.com/apache/couchdb/blob/2c23d8e0fc8e44fa3f91228aa93002a9977e7221/src/couch/src/couch_httpd.erl#L988-L999 (I am afraid this never actually made it to the chttpd side!)

Feb 23 '22 19:02 nickva

I guess with n>1 it would happen as well, if all the nodes have the revision already because we would never pull the data bytes off the socket.

Yes, I think so, although I'm now realizing that "pulling bytes off the socket" is not necessarily the issue in either the n=1 or n>1 case (with or without concurrent data races). The parser might well consume all the request data (and for small requests it probably does, even if no worker ever requests it), but the HTTPD process will still be stuck in WaitFun() after sending the response to the client. The connection stays open, and the client sends a new request on that connection, but HTTPD won't act on it until that 5 minute timeout fires.

Feb 23 '22 20:02 kocolosk

I think this issue is fixed. Closing.

Oct 08 '22 18:10 nickva

couchdb couchdb copied to clipboard

Race condition causing slow responses to `PUT /db/doc` with `multipart/related` attachments

Relevant code

Cluster setup

Observations

Comments

couchdb
couchdb copied to clipboard