couchdb icon indicating copy to clipboard operation
couchdb copied to clipboard

Long running erlang map/reduce can block view compaction from completion, leaking erlang procs

Open KangTheTerrible opened this issue 1 year ago • 7 comments

Description

A long running/slow erlang map/reduce due to a new shard deployment, appears to be blocking that shards view compaction from completing. It also appears to be leaking/growing erlang procs at a steady rate, between 5k-10k per hour.

Steps to Reproduce

Start view Compaction Start long Erlang/reduce View Compaction tries to complete, is unable to until indexer completes (suspected, waiting to observe this outcome) Observe steady increase in erlang procs (may require continued insertion/interaction with the shard)

Expected Behaviour

View compaction should not be blocked Erlang procs should not continue to increase until it hits the limit and crashes

Your Environment

AWS C6i.x32large 5 nodes q=3 n=5

  • CouchDB version used: 3.2.2
  • Operating system and version: Debian Buster

Additional Context

We resharded which resulted in the erlang map reduce being a lot longer than it should(not incremental).

KangTheTerrible avatar Aug 10 '23 11:08 KangTheTerrible

Additional piece of useful info, it seems that while the index is running for the first time I got this from the erlang views metadata, the leaking erlang procs appear to be the "clients waiting for the index".

_design/erlangstatsstats Metadata
Index Information
Language:Erlang
Currently being updated?Yes
Currently running compaction?Yes
Waiting for a commit?Yes
Clients waiting for the index:719422
Update sequence on DB:257926611
Processed purge sequence:0
Actual data size (bytes):602,563,809,246
Data size on disk (bytes):1,187,591,035,418
MD5 Signature:

KangTheTerrible avatar Aug 11 '23 17:08 KangTheTerrible

This does eventually resolve gracefully, given enough erlang procs and storage. Additional change that had to be made to keep on top of storage was to increase the view ratio smoosh concurrency values since stuck compactions prevented other compactions from running.

KangTheTerrible avatar Aug 17 '23 17:08 KangTheTerrible

One strategy could be to periodically ping the https://docs.couchdb.org/en/stable/api/ddoc/common.html#db-design-design-doc-info endpoint and wait until the index has completed building before querying it to avoid piling up too many client requests if the index is large.

Using a larger Q (resharding) could also help parallelize indexing building if you have the computation and disk throughput resources.

nickva avatar Aug 22 '23 01:08 nickva

Yeah Nick, in our case unfortunately this was a live production server so we had no trivial means to block users from attempting to access the view. Worth noting, none of these clients were waiting, all view requests to this view are stable=false&update=lazy

KangTheTerrible avatar Aug 22 '23 14:08 KangTheTerrible

https://docs.couchdb.org/en/stable/best-practices/views.html#deploying-a-view-change-in-a-live-environment

rnewson avatar Aug 24 '23 10:08 rnewson

Hi actually I can't see any outstanding lines in debug mode in the log. it just no logs from yesterday and process is unable to be recognized and top is 5.0. Not consuming too much memory.

do you know how to flush debug from erlang?

fr2lancer avatar Nov 08 '23 23:11 fr2lancer

  • I'll second @rnewson's proposal to try a old-ddoc/new-ddoc strategy to deploy new views.

  • For clients could use stable=false&update=false and let ken (index auto-builder) to build the indices for you in the background. Monitor with _active_tasks.

  • There is an undocumented [smoosh.ignore] $shard = true setting to allow the auto-compactor to ignore specific shards. For example:

[smoosh.ignore]
shards/e0000000-ffffffff/dbname.1660859921 = true
  • @fr2lancer if you're asking about debug logging for compaction/auto-compaction see issue https://github.com/apache/couchdb/issues/4815#issuecomment-1791518288. That's a bit tricky to set but it should work.

  • In your version of CouchDB 3.2.2 we had a bug calculating the slack and ratio and ended up triggering the auto-compactor too often. Consider upgrading to 3.3.3 if possible. You might find some of the compaction don't trigger as often any longer. That was fixed in 3.3.0 (https://github.com/apache/couchdb/pull/4264)

nickva avatar Dec 06 '23 04:12 nickva