Continuous replication jobs missed from _active_tasks after changes_reader_died,{timeout,ibrowse_stream_cleanup} message appears in the log
Some of the replication jobs did not start after changes_reader_died,{timeout,ibrowse_stream_cleanup} message
Description
Jul 24 14:36:52 xxxx-xxxx-ro-us-west1-x couchdb[826059]:
ChangesReader process died with reason: {changes_reader_died,{timeout,ibrowse_stream_cleanup}}
Replication 0c59025774c3cc12f61c49f2e2c02c5d+continuous (https://xxxx-xxxx-master-x.pr-xxxx-xxxx.str.xxxxxxx.com/xxxx_xxxx-300/ -> https://xxxx-xxxxx-ro-us-west1-x.pr-xxxxx-xxxxxx.str.xxxxx.com/xxxx_xxxx-300/) failed: {changes_reader_died,{{timeout,ibrowse_stream_cleanup}}
When ChangesReader process died on xxxx_xxxx-300 , the _scheduler/jobs didn't crashed and restarted (Last crashed was on 2025-07-23T15:22:33)
{ "database": "_replicator", "id": "0c59025774c3cc12f61c49f2e2c02c5d+continuous", "pid": "<0.1761.7093>", "source": "https://xxxx-xxxx-master-x.pr-xxxxx-xxxxx.str.xxxx.com/xxxx_xxxx-300/", "target": "https://xxxx-xxxxx-ro-us-west1-x.pr-xxxx-xxxxxx.str.xxxxx.com/xxxx_xxxx-300/", "user": null, "doc_id": "xxxxx_xxxx_replication_300", "info": { "revisions_checked": 3888719, "missing_revisions_found": 421066, "docs_read": 421064, "docs_written": 421064, "changes_pending": 0, "doc_write_failures": 0, "bulk_get_docs": 421064, "bulk_get_attempts": 421064, "checkpointed_source_seq": "22113348-g1AAAAJveJyl0EEOgjAQBdBGSFx4Fgi1ILKSQ3CBdqgWUoopuNYzuPI2eiVPgBMwbpvUzUwyk_x5GU0ICVVQkxz6C6halM0ouwg4KBl1fBiljbIY4rONhrG3OOsNt6DixuDKcK0xYMWJOEzT1KpAkGpz6nC25ls41jvmn-xQMadKlFjF9QfT7xkmISlkvvMPd8CoE2ZCrOSGDW3PBccfM46mlGc08T_w99cW3GvBfT93r2Yc2xcspbn_gfYDW0XUDA", "source_seq": "22113348-g1AAAAJveJyl0EEOgjAQBdBGSFx4Fgi1ILKSQ3CBdqgWUoopuNYzuPI2eiVPgBMwbpvUzUwyk_x5GU0ICVVQkxz6C6halM0ouwg4KBl1fBiljbIY4rONhrG3OOsNt6DixuDKcK0xYMWJOEzT1KpAkGpz6nC25ls41jvmn-xQMadKlFjF9QfT7xkmISlkvvMPd8CoE2ZCrOSGDW3PBccfM46mlGc08T_w99cW3GvBfT93r2Yc2xcspbn_gfYDW0XUDA", "through_seq": "22113348-g1AAAAJveJyl0EEOgjAQBdBGSFx4Fgi1ILKSQ3CBdqgWUoopuNYzuPI2eiVPgBMwbpvUzUwyk_x5GU0ICVVQkxz6C6halM0ouwg4KBl1fBiljbIY4rONhrG3OOsNt6DixuDKcK0xYMWJOEzT1KpAkGpz6nC25ls41jvmn-xQMadKlFjF9QfT7xkmISlkvvMPd8CoE2ZCrOSGDW3PBccfM46mlGc08T_w99cW3GvBfT93r2Yc2xcspbn_gfYDW0XUDA" }, "history": [ { "timestamp": "2025-07-23T15:22:33Z", "type": "started" }, { "timestamp": "2025-07-23T15:22:33Z", "type": "crashed", "reason": "{changes_reader_died,{timeout,ibrowse_stream_cleanup}}" },
Steps to Reproduce
Expected Behaviour
The replication job should be in _active_tasks , may be in 'running' or 'pending' state. But the current issue is it never restarts if it is stopped. Looks like couchdb checks if _scheduler/docs(jobs) exists it assumes that _active tasks will be there.
we verified there are new documents added in source which never appears in target database even waiting for few days.
As per couchdb document "_Changed in version 2.1.0: Because of how the scheduling replicator works, continuous replication jobs could be periodically stopped and then started later. When they are not running they will not appear in the _active_tasks endpoint"
Note: But some time even though when changes_reader_died,{timeout,ibrowse_stream_cleanup} happens for some database , the _scheduler/job crashes and restarts and everything becomes normal.
[NOTE]:
To restart the replication for the missing databases, we have to bounce couchdb on that node. But same issue happens on some other databases on some other nodes after few weeks.
Is there any other way can we restart replication without bouncing the node?
Note : we tried to Update the failed replication user ID and password with invalid entries , which we thought it will crash the _scheduler/job for that replication and will restart after adding correct user ID and password. But It didn't crashed the replication ( The replications for which _active_tasks was missed ). This forced us to bounce couchdb on that node to restart replication.
Your Environment
Couchdb version :
"couchdb":"Welcome","version":"3.4.3","git_sha":"f1a47e66","uuid":"67fc0abd32xxx0c38f75cc627b77411d9f","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"],"vendor":{"name":"The Apache Software Foundation"}}
- Operating system and version:
NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.5 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu
Additional Context
Note: At : Jul 24 14:36:52 few database stopped replicating due to missing entries in _active_tasks Jul 24 15:46:06 few database stopped replicating due to missing entries in _active_tasks in addition to previous missing
The failed replication number increases if we wait longer and longer before bouncing the node. All failures happens on same node in the cluster.
Replication jobs not appear in be in _active_tasks if they crash too often and are penalized. That endpoint only shows job which are actively running. The more failed starts a job has back-to-back, the the further backed off it might get; up to 8 hours total.
Note : we tried to Update the failed replication user ID and password with invalid entries
Try to deleting the document and re-add it, otherwise a failing auth will trigger repeated failed restarts which will trigger exponential backoffs. If that doesn't work, you can adjust the effective max backoff time by adjusting the maximum number of history entries, for example, lowering it to 8 [replicator] max_history = 8 instead of 20 will limit the maximum penalty period. The history entries can be seen in the _scheduler/jobs endpoint as you already noticed.
We recently fixed a changes_reader "stream cleanup" issue https://github.com/apache/couchdb/pull/5555 but I don't think it's in a release yet. But if you have the ability to build a CouchDB from source, could give that a try.