couchdb
couchdb copied to clipboard
Long running erlang map/reduce can block view compaction from completion, leaking erlang procs
Description
A long running/slow erlang map/reduce due to a new shard deployment, appears to be blocking that shards view compaction from completing. It also appears to be leaking/growing erlang procs at a steady rate, between 5k-10k per hour.
Steps to Reproduce
Start view Compaction Start long Erlang/reduce View Compaction tries to complete, is unable to until indexer completes (suspected, waiting to observe this outcome) Observe steady increase in erlang procs (may require continued insertion/interaction with the shard)
Expected Behaviour
View compaction should not be blocked Erlang procs should not continue to increase until it hits the limit and crashes
Your Environment
AWS C6i.x32large 5 nodes q=3 n=5
- CouchDB version used: 3.2.2
- Operating system and version: Debian Buster
Additional Context
We resharded which resulted in the erlang map reduce being a lot longer than it should(not incremental).
Additional piece of useful info, it seems that while the index is running for the first time I got this from the erlang views metadata, the leaking erlang procs appear to be the "clients waiting for the index".
_design/erlangstatsstats Metadata
Index Information
Language:Erlang
Currently being updated?Yes
Currently running compaction?Yes
Waiting for a commit?Yes
Clients waiting for the index:719422
Update sequence on DB:257926611
Processed purge sequence:0
Actual data size (bytes):602,563,809,246
Data size on disk (bytes):1,187,591,035,418
MD5 Signature:
This does eventually resolve gracefully, given enough erlang procs and storage. Additional change that had to be made to keep on top of storage was to increase the view ratio smoosh concurrency values since stuck compactions prevented other compactions from running.
One strategy could be to periodically ping the https://docs.couchdb.org/en/stable/api/ddoc/common.html#db-design-design-doc-info endpoint and wait until the index has completed building before querying it to avoid piling up too many client requests if the index is large.
Using a larger Q (resharding) could also help parallelize indexing building if you have the computation and disk throughput resources.
Yeah Nick, in our case unfortunately this was a live production server so we had no trivial means to block users from attempting to access the view. Worth noting, none of these clients were waiting, all view requests to this view are stable=false&update=lazy
https://docs.couchdb.org/en/stable/best-practices/views.html#deploying-a-view-change-in-a-live-environment
Hi actually I can't see any outstanding lines in debug mode in the log. it just no logs from yesterday and process is unable to be recognized and top is 5.0. Not consuming too much memory.
do you know how to flush debug from erlang?
-
I'll second @rnewson's proposal to try a old-ddoc/new-ddoc strategy to deploy new views.
-
For clients could use
stable=false&update=false
and let ken (index auto-builder) to build the indices for you in the background. Monitor with_active_tasks
. -
There is an undocumented
[smoosh.ignore] $shard = true
setting to allow the auto-compactor to ignore specific shards. For example:
[smoosh.ignore]
shards/e0000000-ffffffff/dbname.1660859921 = true
-
@fr2lancer if you're asking about debug logging for compaction/auto-compaction see issue https://github.com/apache/couchdb/issues/4815#issuecomment-1791518288. That's a bit tricky to set but it should work.
-
In your version of CouchDB 3.2.2 we had a bug calculating the slack and ratio and ended up triggering the auto-compactor too often. Consider upgrading to 3.3.3 if possible. You might find some of the compaction don't trigger as often any longer. That was fixed in 3.3.0 (https://github.com/apache/couchdb/pull/4264)