pgcluu icon indicating copy to clipboard operation
pgcluu copied to clipboard

Critical Memory Consumption in Production Database Environment

Open prensgold opened this issue 7 months ago • 7 comments

We encountered a critical issue with pgcluu in our production PostgreSQL Patroni cluster environment. When collecting logs for a support case, pgcluu caused Perl to consume 100% of memory within 8 seconds, resulting in an "Out of Memory" error and impacting our leader database server.

Environment Details:

  • PostgreSQL version: 15.2
  • pgcluu version: 4.0
  • OS: RedHat 8.10
  • Server specs: 12 CPU (on VM), 128GB RAM
  • Database size: 200GB
  • Patroni cluster configuration: 3-node cluster
  • pgcluu collected log volume: 17GB

Steps to Reproduce:

  1. Initiated pgcluu log collection on leader node with command: pgcluu -o /u01/pgcluu /u01/pgcluu_collectd/ -b '2025-03-05 08:30:17' -e '2025-03-08 08:30:17'
  2. Within 8 seconds, memory usage spiked to 100%
  3. System crashed with OOM error
  4. Required emergency switchover to standby node

Additional Information:

  • The same pgcluu configuration worked successfully in our test environment with smaller data volume
  • No memory limits were set for pgcluu process

Questions:

  1. Is this a known issue with large databases?
  2. Are there configuration parameters we should adjust to limit memory consumption?
  3. Are there workarounds recommended for collecting logs on large production databases?
  4. Could this be a bug in the current version?

We're willing to provide additional information or logs as needed. This issue is high priority for us as it impacts our ability to collect necessary logs for support cases.

2025-04-25_16h31_26.png

prensgold avatar Apr 25 '25 13:04 prensgold

@prensgold when you plan to run pgcluu_collectd permanently and especially on PostgreSQL cluster with high activity you must use the incremental mode using the --rotate-daily or better with --rotate-hourly

When you use pgcluu to generate the report, use option --retention to set the number of retention days. Older stats will be cleanup to preserve disk space.

darold avatar Apr 26 '25 20:04 darold

@darold When starting the service by default, doesn't it rotate daily? pgcluu_collectd.service

prensgold avatar Apr 27 '25 13:04 prensgold

Yes: ExecStart=/usr/local/bin/pgcluu_collectd --daemonize --rotate-daily --pid-file $PIDFILE $STATDIR

Use --rotate-hourly instead, I will change the default.

darold avatar Apr 28 '25 08:04 darold

Thank you for your response to our issue report.

We have an additional question regarding the rotate-daily parameter. We understand this parameter should be enabled by default in pgcluu, which should split the collected data on a daily basis to manage log volume. However, we couldn't observe this behavior in our log collection directory during our implementation.

Is the rotate-daily functionality working as expected in version 4.0? Should we be looking for the daily log splits in a specific location other than the main log collection directory? We're wondering if perhaps we're checking the wrong location or if this parameter might not be functioning correctly in our environment.

We are sharing screenshots of the pgcluu version and the log collection directory for your reference. This information may help you better understand the issue.

This could potentially explain the memory consumption issue we encountered, if logs weren't being properly rotated during our collection process.

Image

prensgold avatar Apr 28 '25 12:04 prensgold

Hi @prensgold yes it is fully working. Can you please run manually the following command and post the output.

mkdir /tmp/pgcluudata/
/usr/bin/pgcluu_collectd -i 10 --rotate-houlry --pid-file /tmp/pgluu.pid /tmp/pgcluudata/

then Ctrl+c after 20 seconds.

Post any error reported by the command and the output of ls /tmp/pgcluudata/

darold avatar Apr 29 '25 04:04 darold

The command executed without any errors and ran successfully for 120 seconds before I stopped it with Ctrl+C. When I checked the contents of /tmp/pgcluudata/, I could see the hourly rotation working correctly as expected.

Image

However, my original issue is that I don't see this same rotation behavior in our default installation directory. When we run pgcluu with its default settings (which should include daily rotation according to the documentation), we don't observe the logs being split into daily directories.

Could there be a configuration issue with our default setup that's preventing the rotate-daily functionality from working properly? Or perhaps we're looking in the wrong location for the rotated logs in the default installation?

This rotation issue might be related to our memory consumption problem, as without proper rotation, pgcluu might be trying to process all 17GB of logs at once instead of working with smaller, rotated segments.

prensgold avatar May 02 '25 12:05 prensgold

When you start pgcluu_collectd using the service file, what does ps auwx | grep pgcluu reports? Maybe the --rotate-hourly option is lost for some reason?

darold avatar May 02 '25 13:05 darold