Critical Memory Consumption in Production Database Environment
We encountered a critical issue with pgcluu in our production PostgreSQL Patroni cluster environment. When collecting logs for a support case, pgcluu caused Perl to consume 100% of memory within 8 seconds, resulting in an "Out of Memory" error and impacting our leader database server.
Environment Details:
- PostgreSQL version: 15.2
- pgcluu version: 4.0
- OS: RedHat 8.10
- Server specs: 12 CPU (on VM), 128GB RAM
- Database size: 200GB
- Patroni cluster configuration: 3-node cluster
- pgcluu collected log volume: 17GB
Steps to Reproduce:
- Initiated pgcluu log collection on leader node with command: pgcluu -o /u01/pgcluu /u01/pgcluu_collectd/ -b '2025-03-05 08:30:17' -e '2025-03-08 08:30:17'
- Within 8 seconds, memory usage spiked to 100%
- System crashed with OOM error
- Required emergency switchover to standby node
Additional Information:
- The same pgcluu configuration worked successfully in our test environment with smaller data volume
- No memory limits were set for pgcluu process
Questions:
- Is this a known issue with large databases?
- Are there configuration parameters we should adjust to limit memory consumption?
- Are there workarounds recommended for collecting logs on large production databases?
- Could this be a bug in the current version?
We're willing to provide additional information or logs as needed. This issue is high priority for us as it impacts our ability to collect necessary logs for support cases.
@prensgold when you plan to run pgcluu_collectd permanently and especially on PostgreSQL cluster with high activity you must use the incremental mode using the --rotate-daily or better with --rotate-hourly
When you use pgcluu to generate the report, use option --retention to set the number of retention days. Older stats will be cleanup to preserve disk space.
@darold When starting the service by default, doesn't it rotate daily? pgcluu_collectd.service
Yes: ExecStart=/usr/local/bin/pgcluu_collectd --daemonize --rotate-daily --pid-file $PIDFILE $STATDIR
Use --rotate-hourly instead, I will change the default.
Thank you for your response to our issue report.
We have an additional question regarding the rotate-daily parameter. We understand this parameter should be enabled by default in pgcluu, which should split the collected data on a daily basis to manage log volume. However, we couldn't observe this behavior in our log collection directory during our implementation.
Is the rotate-daily functionality working as expected in version 4.0? Should we be looking for the daily log splits in a specific location other than the main log collection directory? We're wondering if perhaps we're checking the wrong location or if this parameter might not be functioning correctly in our environment.
We are sharing screenshots of the pgcluu version and the log collection directory for your reference. This information may help you better understand the issue.
This could potentially explain the memory consumption issue we encountered, if logs weren't being properly rotated during our collection process.
Hi @prensgold yes it is fully working. Can you please run manually the following command and post the output.
mkdir /tmp/pgcluudata/
/usr/bin/pgcluu_collectd -i 10 --rotate-houlry --pid-file /tmp/pgluu.pid /tmp/pgcluudata/
then Ctrl+c after 20 seconds.
Post any error reported by the command and the output of ls /tmp/pgcluudata/
The command executed without any errors and ran successfully for 120 seconds before I stopped it with Ctrl+C. When I checked the contents of /tmp/pgcluudata/, I could see the hourly rotation working correctly as expected.
However, my original issue is that I don't see this same rotation behavior in our default installation directory. When we run pgcluu with its default settings (which should include daily rotation according to the documentation), we don't observe the logs being split into daily directories.
Could there be a configuration issue with our default setup that's preventing the rotate-daily functionality from working properly? Or perhaps we're looking in the wrong location for the rotated logs in the default installation?
This rotation issue might be related to our memory consumption problem, as without proper rotation, pgcluu might be trying to process all 17GB of logs at once instead of working with smaller, rotated segments.
When you start pgcluu_collectd using the service file, what does ps auwx | grep pgcluu reports? Maybe the --rotate-hourly option is lost for some reason?