patroni icon indicating copy to clipboard operation
patroni copied to clipboard

Generated pg_basebackup command should be logged

Open caseyandgina opened this issue 1 year ago • 6 comments

What happened?

When Patroni initializes a node, the pg_basebackup command it generates is not logged, and it's challenging to try to reproduce from the configuration. I ran into an issue recently where there was a lingering hidden file in a custom WAL directory when I was trying to reinitialize, and it took me hours to figure out why. This is not the first time I've run into this - it's tripped me up at least a half dozen times over the past few years. It should not be this hard to determine what is going wrong, and logging the generated command would allow me to quickly copy/paste and try to run it in the foreground. Perhaps including logging of the pg_basebackup error output would be in order as well, at least when log level is set to DEBUG.

How can we reproduce it (as minimally and precisely as possible)?

Reinitialize a node and watch the Patroni log, at either INFO or DEBUG log level.

What did you expect to happen?

Something more than this:

ERROR: Error when fetching backup: pg_basebackup exited with code=1 ERROR: failed to bootstrap from leader '...'

Patroni/PostgreSQL/DCS version

  • Patroni version: 3.2.1
  • PostgreSQL version: 14.10
  • DCS (and its version): Not relevant

Patroni configuration file

---
scope: 14-truth
namespace: /db/
name: prime-br2-pg-truth-0

bootstrap:
  dcs:
    loop_wait: 3
    maximum_lag_on_failover: 1048576
    retry_timeout: 10
    ttl: 30
    postgresql:
      parameters:
        archive_command: envdir /etc/wal-g/14-truth /usr/local/bin/wal-g wal-push "%p"
        archive_mode: true
        archive_timeout: 5min
        autovacuum_vacuum_cost_delay: 10ms
        autovacuum_vacuum_scale_factor: 0.2
        autovacuum_vacuum_threshold: 10000
        checkpoint_completion_target: 0.9
        checkpoint_timeout: 30min
        cluster_name: truth
        default_statistics_target: 100
        effective_cache_size: 252GB
        effective_io_concurrency: 200
        hot_standby_feedback: true
        huge_pages: try
        idle_in_transaction_session_timeout: 15min
        log_autovacuum_min_duration: 0
        log_checkpoints: true
        log_connections: true
        log_destination: stderr
        log_directory: /var/log/postgresql/14/truth
        log_disconnections: true
        log_duration: false
        log_error_verbosity: DEFAULT
        log_file_mode: '0640'
        log_filename: '%Y-%m-%d_%H.log'
        log_line_prefix: '%m [%p]: %qdb=%d,user=%u,app=%a,client=%h %v '
        log_lock_waits: true
        log_min_duration_statement: 250ms
        log_min_messages: WARNING
        log_recovery_conflict_waits: true
        log_replication_commands: false
        log_rotation_age: 1h
        log_rotation_size: 0
        log_statement: none
        log_temp_files: 0
        log_timezone: Etc/UTC
        logging_collector: true
        maintenance_work_mem: 2GB
        max_connections: 2500
        max_logical_replication_workers: 24
        max_parallel_maintenance_workers: 4
        max_parallel_workers: 64
        max_parallel_workers_per_gather: 4
        max_replication_slots: 64
        max_stack_depth: 7MB
        max_sync_workers_per_subscription: 24
        max_wal_senders: 16
        max_wal_size: 120GB
        min_wal_size: 70GB
        max_worker_processes: 64
        pg_stat_statements.track: all
        random_page_cost: 1.1
        restore_command: envdir /etc/wal-g/14-truth /usr/local/bin/wal-g wal-fetch "%f" "%p"
        shared_buffers: 126GB
        shared_preload_libraries: pg_stat_statements,pg_cron
        stats_temp_directory: /run/postgresql/14-truth.pg_stat_tmp
        superuser_reserved_connections: 30
        temp_file_limit: 4GB
        timezone: Etc/UTC
        track_activities: true
        track_activity_query_size: 4096
        track_functions: all
        track_io_timing: true
        unix_socket_directories: /run/postgresql
        wal_buffers: 64MB
        wal_keep_size: 60GB
        wal_level: logical
        work_mem: 64MB
      pg_hba:
        REDACTED
      pg_ident:
        REDACTED
  initdb:
    - allow-group-access
    - data-checksums
    - encoding: UTF8
    - locale: en_US.UTF8
    - waldir: /var/lib/postgresql/wal/14/truth
  users:
    REDACTED

etcd3:
  hosts:
    - prime-br2-etcd-0:2379
    - prime-br2-etcd-1:2379
    - prime-br2-etcd-2:2379

log:
  level: INFO
  dir: /var/log/patroni/14-truth
  file_num: 4
  file_size: 16777216

postgresql:
  authentication:
    replication:
      username: replication
      password: 'REDACTED'
    rewind:
      username: postgres
      password: 'REDACTED'
    superuser:
      username: postgres
      password: 'REDACTED'
  basebackup:
    - checkpoint: fast
    - waldir: /var/lib/postgresql/wal/14/truth
  bin_dir: /usr/lib/postgresql/14/bin
  connect_address: prime-br2-pg-truth-0:5432
  create_replica_methods:
    - basebackup
  data_dir: /var/lib/postgresql/14/truth
  listen: '*:5432'
  pgpass: /var/lib/postgresql/.pgpass.14-truth
  use_pg_rewind: true
  use_unix_socket: true

restapi:
  listen: '*:8008'
  connect_address: prime-br2-pg-truth-0:8008

patronictl show-config

Not relevant

Patroni log files

ERROR: Error when fetching backup: pg_basebackup exited with code=1
ERROR: failed to bootstrap from leader '...'

PostgreSQL log files

N/A, it never gets that far.

Have you tried to use GitHub issue search?

  • [X] Yes

Anything else we need to know?

No response

caseyandgina avatar Jan 22 '24 17:01 caseyandgina

When Patroni initializes a node, the pg_basebackup command it generates is not logged, and it's challenging to try to reproduce from the configuration

Well, just logging a command to be executed isn't a problem, but it will not be a simple copy&paste for you, mainly because the password is never supplied directly.

I ran into an issue recently where there was a lingering hidden file in a custom WAL directory when I was trying to reinitialize, and it took me hours to figure out why. This is not the first time I've run into this - it's tripped me up at least a half dozen times over the past few years. It should not be this hard to determine what is going wrong

This is more or less related to your logging setup. I bet you configured Patroni to write logs directly to a files. In this case logs of any other tool or script still go to the stderr and end up somewhere else (syslog?). And you haven't checked it this place.

Perhaps including logging of the pg_basebackup error output would be in order as well, at least when log level is set to DEBUG.

Regarding capturing pg_basebackup logs - I am strongly against it, because for example it will break the --verbose mode.

CyberDem0n avatar Jan 23 '24 07:01 CyberDem0n

I have no desire for the password to be logged, and would hope that Patroni would not put it into the call itself as then it could be exposed via a process list. Does Patroni set the PGPASSWORD environment variable for this? It's complicated to try to piece together the full pg_basebackup command otherwise though, from some parameters that are explicitly defined, others which are defaulted, and at least the hostname being based on the current cluster state. I'd love a command to copy/paste and try running as the postgres system user, which would prompt for the password unless I first set PGPASSWORD.

You are correct in that I've configured Patroni to log to discrete files. I did not think to check syslog, but looking back now, I do not see anything related logged there at all. As far as the syslog configuration, I'm using a default Ubuntu 20.04 configuration.

Lastly, part of the problem is that there's a bug in the Debian packaging wrapper that is used when running pg_basebackup without an explicit path to the binary directory for the specific PostgreSQL version. If that path is included with the logged command executed, that would solve that confusion as well. I'll also get a bug report filed with the Debian package maintainers about that. FYI, the wrapper prints out an inaccurate and misleading error:

$ pg_basebackup --version
Error: /var/lib/postgresql/14/main is not accessible; please fix the directory permissions (/var/lib/postgresql/14/ should be world readable)

The actual binary from the Postgres packages does not throw that error for --version, and furthermore does not state that the parent directory needs to be world-readable, which is not actually true.

caseyandgina avatar Jan 23 '24 17:01 caseyandgina

I do not see anything related logged there at all.

That's very suspicious. Whatever pg_basebackup writes to stdout/stderr should end up in one or another file.

Lastly, part of the problem is that there's a bug in the Debian packaging wrapper that is used when running pg_basebackup without an explicit path to the binary directory for the specific PostgreSQL version. If that path is included with the logged command executed

It's very arguable about bug in Debian wrapper scripts. It seems that you neither set postgresql.bin_dir nor made sure that postgres binaries are available in $PATH [1]. Such Debain wrapper scripts are available only for client binaries like psql and pg_basebackup, and not for postgres, pg_ctl, pg_rewind and so on. Even if pg_basebackup succeed everything would fail because Patroni can't find the postgres binary.

If that path is included with the logged command executed, that would solve that confusion as well

That's the thing. If path isn't specified in the Patroni config, the binary is executed from the default $PATH. What makes things even worse - you didn't show the entire Patroni configuration file. Lack of postgresql.bin_dir is an error that is very easy to spot.

[1] https://patroni.readthedocs.io/en/latest/yaml_configuration.html#postgresql

bin_dir: (optional) Path to PostgreSQL binaries (pg_ctl, initdb, pg_controldata, pg_basebackup, postgres, pg_isready, pg_rewind). If not provided or is an empty string, PATH environment variable will be used to find the executables.

CyberDem0n avatar Jan 24 '24 07:01 CyberDem0n

I do set postgresql.bindir, but I did not think to include that when attempting to replicate manually at first. Another good reason for having the entire generated command logged. I can update this ticket tomorrow with my entire configuration, sans passwords.

caseyandgina avatar Jan 24 '24 08:01 caseyandgina

@caseyandgina I don't mind adding a debug log for that. Please open a PR.

CyberDem0n avatar Jan 26 '24 15:01 CyberDem0n

@CyberDem0n I updated the ticket to contain a mostly-complete Patroni YML, with sensitive information redacted. I'm not a developer and am not sure how to make a PR, unfortunately.

caseyandgina avatar Jan 26 '24 21:01 caseyandgina