Increasing memory usage with debug log level and noisy logs from reduce functions

Open dianabarsan opened this issue 9 months ago • 1 comments

Description

When log level is set to debug, there exists at least one view with "lengthy" emissions (many characters, for example uuids) with a reduce function, and there is a constant inflow of documents, CouchDb will gradually increase RAM usage, until it runs out of memory.

Steps to Reproduce

This can be replicated on latest version of CouchDb (I just launched a local instance in a docker container) in single node.

In this gist https://gist.github.com/dianabarsan/408ed8b08bb653a6d16c04ba2126556b, there is a script that:

sets log level to debug
creates a database with one ddoc with one reduce function
pushes 1 million documents that would be indexed
triggers view indexes

Running this script against a CouchDb in a docker container (compose.yml also provided in the gist) should trigger gradually increasing RAM usage.

Expected Behaviour

Memory should be released, even when log level is debug.

Your Environment

Production server: docker image of CouchDb v. 3.3.3 launched in Kubernetes cluster hosted on AWS

Local test server: docker image of CouchDb 3.4.2, launched in docker container on Manjaro Linux.

Additional Context

In a three node CouchDb cluster, we have observed one node gradually increasing RAM usage compared to the other two nodes - up to 200GB vs 3GB. The k8s node would eventually run out of RAM and be expelled. Our deployment receives a rather constant flow of new documents.

The only difference between the cluster nodes was that the runaway RAM node had the log level set to debug.

Upon further investigation, it turned out that having reduce functions (in our case _stats) for views that have "lengthy" emissions was triggering the high RAM usage, presumably due to accumulating large debug logs.

After reverting the log level to info, this behavior disappeared and RAM usage became consistent within the CouchDb cluster.

For my local deployment, I have noticed that RAM does eventually get released gradually AFTER there are no more writes. In our production environment this was never the case.

Mar 12 '25 23:03 dianabarsan

Thanks for you report @dianabarsan. I wonder if it's because the downstream logging service or the pod standard error logger could not cope with the large amounts of logs being generated and they were being backed up in memory somehow.

With 3.3.3 vs 3.4.2 could be an Erlang VM memory issue that got fixed. It's not recommended, in general, to run in production with debug logs enabled persistently.

If you can catch it happening again and have remsh access there are a few helper functions to pinpoint a particular process that may be consuming memory:

couch_debug:resource_hoggers(couch_debug:memory_info(processes()), heap_size).
|                        id                        |     heap_size
|               code_server[<0.51.0>]              |       318187
|             erlang:apply/2[<0.701.0>]            |       75113
...

 couch_debug:resource_hoggers(couch_debug:memory_info(processes()), binary).
|                        id                        |       binary
|       supervisor_bridge:user_sup/1[<0.71.0>]     |       123704
|               code_server[<0.51.0>]              |       112364
|   couch_epi_functions_gen_chttpd_auth[<0.154.0>] |       97656

couch_debug:resource_hoggers(couch_debug:memory_info(processes()), memory).
|                        id                        |       memory
|               code_server[<0.51.0>]              |      6665104
|             erlang:apply/2[<0.701.0>]            |      2551960
|             erlang:apply/2[<0.5439.0>]           |      2547712
|                 config[<0.146.0>]                |      1115032
|            couch_log_server[<0.180.0>]           |       431880

([email protected])10> couch_debug:resource_hoggers(couch_debug:memory_info(processes()), message_queue_len).
|                        id                        | message_queue_len
|                   init[<0.0.0>]                  |         0
|             erts_code_purger[<0.1.0>]            |         0

So if you manage to run those we could get some where exactly is the memory backup. I am guessing it's the log writer process...

Mar 21 '25 15:03 nickva