couchdb icon indicating copy to clipboard operation
couchdb copied to clipboard

`clients_requesting_changes` gauge not decrementing when clients disconnect in CouchDB 3.5

Open AmitPhulera opened this issue 4 weeks ago • 10 comments

Description

The couchdb_httpd_clients_requesting_changes gauge metric is not properly decremented when clients disconnect from the _changes feed. The metric continues to show the cumulative count of connections rather than the current active connection count.

Interestingly, the abandoned_streaming_requests counter IS being incremented when disconnections are detected, indicating that the disconnect detection logic is working, but the main gauge is not being updated.

Steps to Reproduce

Setup

  1. Start CouchDB 3.5 using Docker:

    docker run -d --name couchdb -p 5984:5984 \
      -e COUCHDB_USER=admin -e COUCHDB_PASSWORD=password \
      apache/couchdb:3.5
    
  2. Create a test database:

    curl -X PUT -u admin:password http://localhost:5984/testdb
    

Reproduce the issue

  1. Check initial stats:

    curl -s -u admin:password "http://localhost:5984/_node/_local/_stats/couchdb/httpd/clients_requesting_changes"
    # Expected: {"value":0,"type":"counter","desc":"number of clients for continuous _changes"}
    
  2. Start 4 continuous changes feed connections (in separate terminals or background processes):

    # Terminal 1
    curl -u admin:password "http://localhost:5984/testdb/_changes?feed=continuous&heartbeat=10000"
    
    # Terminal 2
    curl -u admin:password "http://localhost:5984/testdb/_changes?feed=continuous&heartbeat=10000"
    
    # Terminal 3
    curl -u admin:password "http://localhost:5984/testdb/_changes?feed=continuous&heartbeat=10000"
    
    # Terminal 4
    curl -u admin:password "http://localhost:5984/testdb/_changes?feed=continuous&heartbeat=10000"
    
  3. Verify connections are counted:

    curl -s -u admin:password "http://localhost:5984/_node/_local/_stats/couchdb/httpd/clients_requesting_changes"
    # Expected: {"value":4,...}
    
  4. Stop all 4 curl processes (Ctrl+C on each terminal)

  5. Check stats again:

    curl -s -u admin:password "http://localhost:5984/_node/_local/_stats/couchdb/httpd/clients_requesting_changes"
    # ACTUAL: {"value":4,...} - still shows 4!
    # EXPECTED: {"value":0,...}
    
  6. Check abandoned_streaming_requests:

    curl -s -u admin:password "http://localhost:5984/_node/_local/_stats/couchdb/httpd/abandoned_streaming_requests"
    # Shows: {"value":4,...} - correctly detected 4 abandonments
    
  7. Start 4 new connections again and stop them:

    # Start 4 connections, then stop them with Ctrl+C
    
  8. Check stats:

    curl -s -u admin:password "http://localhost:5984/_node/_local/_stats/couchdb/httpd/clients_requesting_changes"
    # ACTUAL: {"value":8,...} - doubled!
    # EXPECTED: {"value":0,...}
    
    curl -s -u admin:password "http://localhost:5984/_node/_local/_stats/couchdb/httpd/abandoned_streaming_requests"  
    # Shows: {"value":8,...} - correctly shows 8 total abandonments
    
    

The Prometheus endpoint correctly types this as a gauge, but the value doesn't decrease:

curl -s -u admin:password "http://localhost:5984/_node/_local/_prometheus" | grep clients_requesting_changes
# TYPE couchdb_httpd_clients_requesting_changes gauge
# couchdb_httpd_clients_requesting_changes 8

Expected Behaviour

The clients_requesting_changes metric should:

  1. Increment when a client starts a continuous/longpoll _changes feed
  2. Decrement when the client disconnects (either normally or via TCP close)
  3. Always reflect the current number of active changes feed connections

Actual Behavior

  • The metric increments correctly when connections start
  • The metric does NOT decrement when connections end
  • The abandoned_streaming_requests counter correctly tracks disconnections
  • This causes the gauge to act like a monotonic counter, accumulating over time

Your Environment

  • CouchDB version used: 3.5.1
  • Browser name and version: Brave 1.84.139
  • Operating system and version: macOS (Darwin 24.6.0) / Docker
  • Installation method: Docker (apache/couchdb:3.5 image)

Additional Context

This issue was discovered while upgrading from CouchDB 3.3.1 to 3.5.1. The metric worked correctly in 3.3.1.

The issue affects monitoring and alerting systems that rely on this metric to track the number of active changes feed consumers.

AmitPhulera avatar Dec 02 '25 15:12 AmitPhulera

I can repro this with the use of docker (specifically couchdb inside docker and requests from outside). If I run dev/run and make client requests from the same machine, the value is decremented. The abandoned_streaming_requests is bumped by the recent-ish chttpd_util functions that detect disconnected clients.

rnewson avatar Dec 02 '25 15:12 rnewson

Try tweaking these config values:

[chttpd]
disconnect_check_msec = 30000
disconnect_check_jitter_msec = 15000

For instance, make disconnect_check_msec = 300000 and see if it makes a difference. This could mean that before these client requests were sort of left lingering in a half-closed state for a while and the backend kept processing changes.

nickva avatar Dec 02 '25 16:12 nickva

(from slack chat). we believe this is a bug. the client disconnects impolitely. couchdb eventually detects this and closes its side and kills off any worker processes (and increments the abandoned metric). however the only place in code that would decrement the requesting_changes metric will not run as the process was killed before that line was executed (it's in a try/after but that only works if the process gets that far).

We should rejig this so that we decrement the requesting_changes counter in the disconnect detection functions in chttpd_util and remove the 'after' clause to avoid double decrementing. And it's probably a bit trickier than this to get perfect (e.g, the case where the client disconnects cleanly; a fabric:end_changes() call happens or a timeout for those without heartbeat=X).

rnewson avatar Dec 02 '25 18:12 rnewson

I played a little bit in stop_client_process_if_disconnected, but I'm not sure if its the right place ...

stop_client_process_if_disconnected(Pid, ClientReq) ->
    case mochiweb_request:is_closed(ClientReq) of
        true ->
            exit(Pid, {shutdown, client_disconnected}),
            couch_stats:increment_counter([couchdb, httpd, abandoned_streaming_requests]),
+           couch_stats:decrement_counter([couchdb, httpd, clients_requesting_changes]),
            ok;
        false ->
            ok;
        undefined ->
            % Treat unsupported OS-es (ex. Windows) as `not closed`
            % so we default to the previous behavior.
            ok
    end.

big-r81 avatar Dec 02 '25 20:12 big-r81

@big-r81 thanks! I think we'd only want to decrement it if the request was a changes request otherwise we'd end up with negative value soon enough if we have other streaming requests (_views, _all_docs etc)

nickva avatar Dec 02 '25 20:12 nickva

and also only if we didn't decrement it in the 'after' clause for the same reason

rnewson avatar Dec 02 '25 21:12 rnewson

we might have to flip it around so we can easily find the live processes handling _changes responses (to count them) rather than an inc/dec approach.

rnewson avatar Dec 02 '25 21:12 rnewson

e.g, have some process monitor them all and we ask for the monitor count for _stats. when the processes exit (for any reason) the count will go down.

rnewson avatar Dec 02 '25 21:12 rnewson

or start all of the changes coordinators under a new simple_one_to_one supervisor and just count the children for the metric

rnewson avatar Dec 02 '25 21:12 rnewson

Yeah something like a periodically updated metrics gauge of streaming connection counts. We have a dedicated monitor process for each one of those...

Another idea is to abandon the inc/dec pattern and make them all increment so the current number of clients = clients_connected - (clients_successfully_disconnected + clients_abandoned). The metrics caller would have to do that math.

nickva avatar Dec 03 '25 00:12 nickva