couchdb feat(`mango`): rolling execution statistics (exploration)

Overview

In case of map-reduce views, the arrival of the complete message is not guaranteed for the view callback (at the shard) when a stop is issued during the aggregation (at the coordinator). Due to that, internally collected shard-level statistics may not be fed back to the coordinator which can cause data loss hence inaccuracy in the overall execution statistics.

Address this issue by switching to a "rolling" model where row-level statistics are immediately streamed back to the coordinator. Support mixed-version cluster upgrades by activating this model only if requested through the map-reduce arguments and the given shard supports that.

This is only a proposal, a way explore the approach, comments and feedback are welcome. Remarks:

It clearly uses more bandwidth as it will send more execution_stats messages. Is this acceptable?
It does not require changes in fabric itself, the whole logic is kept on the Mango side and it is compatible with the previous solution.
Are there other ways to stop the aggregation as soon as the limit is reached?
Are there other kind of messages coming from the shards that could be reliably used to signal the end of statistics collection (alternatives to capturing complete)?

Testing recommendations

Running the respective Mango unit and integration test suites might suffice (which is done by the CI):

make eunit apps=mango
make mango-test MANGO_TEST_OPTS="15-execution-stats-test"

But there is a detailed description in related the ticket (see below) on how to trigger the problem. Feel free to kick the tires.

Related Issues or Pull Requests

Fixes #4560

Checklist

[x] Code is written and works correctly
[x] Changes are covered by tests

Aug 22 '23 20:08 pgj

Thanks for the comments @rnewson!

If I understand correctly, the way how fabric and rexi are currently implemented and the way how mango uses it for view queries causes the problem. The stop reply at the coordinator level makes fabric to stop all the workers immediately so there is no chance for the complete message to arrive. This decision happens when there are no more rows needed and the coordinator already has all the required data at hand. It would have to wait some more for the complete message to arrive and record the execution statistics properly. But this is not something that fabric supports on receiving a stop. The whole execution therefore becomes a race and if the worker is not quick enough to send the complete message before it is stopped, the statistics data will be lost.

The complete message is sent only if the limit for fabric reaches zero:

https://github.com/apache/couchdb/blob/86df356a8a396a5c165107e6d57f405741257a99/src/fabric/src/fabric_view_map.erl#L178-L204

But mango implements its own limit atop views which is independent from that:

https://github.com/apache/couchdb/blob/86df356a8a396a5c165107e6d57f405741257a99/src/mango/src/mango_cursor_view.erl#L526-L540

That is, in terms of fabric, mango's occasional stop is unexpected technically. In such cases, it is basically passed down to rexi:

https://github.com/apache/couchdb/blob/86df356a8a396a5c165107e6d57f405741257a99/src/fabric/src/fabric_view_map.erl#L106-L122

There stop means that messages from the coordinator's mailbox will not be read further.

https://github.com/apache/couchdb/blob/86df356a8a396a5c165107e6d57f405741257a99/src/rexi/src/rexi_utils.erl#L41-L64

Then the main loop in fabric ends and the workers are cleaned up:

https://github.com/apache/couchdb/blob/86df356a8a396a5c165107e6d57f405741257a99/src/fabric/src/fabric_view_map.erl#L63-L68

That is why I was unsure if the approach in the PR is the right one and it is not something that tackles the symptoms only and does not fix the underlying root cause. But I am afraid that fabric cannot offer guarantees about complete in such cases because that is not how it was originally designed...?

My other concern was certainly the bandwidth usage itself. By this approach, the number of messages sent per rows doubles which may have its own performance implications. But I cannot judge how much it means in practice, if there is an optimization somewhere down in the stack that helps with that and can make the associated costs amortized.

Sep 12 '23 12:09 pgj

That's very helpful background. You are right that it was t anticipated we'd need information back about a worker we know we no longer need to calculate the response.

It would be better to address that directly if we can.

Sep 12 '23 12:09 rnewson

To expand on that last comment, perhaps we alter how workers are killed? Today we do exit(Pid, kill) which is obviously untrappable. As an option to fabric_util:cleanup/1 perhaps we could instead do exit(Pid, finish) (or some other word). I think the workers would get a {rexi_EXIT, finish} message. On receipt of that they'd "complete" whatever made sense for them (in our case, simply messaging out the execution stats up to that point, but not performing any new work). The fabric coordinator would have to wait for either that reply or notification that the worker processes terminated for some other reason, before finishing up.

This way ensures we send no more messages than today, at the small cost of making the coordinator wait for one message from each worker it would normally have unilaterally and asynchronously killed.

Sep 12 '23 13:09 rnewson

Because I did not want to lose the original description of this PR along with the discussion here, and I wanted to put the change in a different perspective in the light of #4812, I forked it as #4958. After talking with @chewbranca about the problem and its solution, and adding the fact that he has been working on a model that would follow a similar approach, the current code seems feasible.

I have studied @rnewson's suggestions but neither of them brought a clear solution to the problem. I am inclined to believe (perhaps I am wrong here) that fixing the issue from the side of fabric would be a more complicated change and as such it hides some risks.

Jan 10 '24 17:01 pgj

Closing this in favor of #4958.

Mar 27 '24 14:03 pgj