postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

PGUpgrade failing

Open todeb opened this issue 1 year ago • 6 comments

Please ensure you do the following when reporting a bug:

  • [ ] Provide a concise description of what the bug is.
  • [ ] Provide information about your environment.
  • [ ] Provide clear steps to reproduce the bug.
  • [ ] Attach applicable logs. Please do not attach screenshots showing logs unless you are unable to copy and paste the log data.
  • [ ] Ensure any code / output examples are properly formatted for legibility.

Note that some logs needed to troubleshoot may be found in the /pgdata/<CLUSTERNAME>/pg_log directory on your Postgres instance.

An incomplete bug report can lead to delays in resolving the issue or the closing of a ticket, so please be as detailed as possible.

If you are looking for general support, please view the support page for where you can ask questions.

Thanks for reporting the issue, we're looking forward to helping you!

Overview

Pgupgrade fail when upgrading from postgres 13 to 16. Images used: crunchy-postgres:ubi8-13.8-1 crunchy-postgres:ubi8-16.3-1

Environment

Please provide the following details:

  • Platform: (Kubernetes, OpenShift, Rancher, GKE, EKS, AKS etc.) Kubernetes
  • Platform Version: (e.g. 1.20.3, 4.7.0) v1.25.16
  • PGO Image Tag: (e.g. ubi8-5.x.y-0) ubi8-5.6.0-0
  • Postgres Version (e.g. 15) 13 -> 16
  • Storage: (e.g. hostpath, nfs, or the name of your storage class) nfs

Steps to Reproduce

  1. create a cluster with: crunchy-postgres:ubi8-13.8-1
  2. follow doc for upgrading: https://access.crunchydata.com/documentation/postgres-operator/latest/guides/major-postgres-version-upgrade
  3. pgupgrade failed log:
Performing PostgreSQL upgrade from version "13" to "16" ...

Step 1: Making new pgdata directory...

Step 2: Initializing new pgdata directory...

The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are enabled.

fixing permissions on existing directory /pgdata/pg16 ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
sh: line 1:    29 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=100 -c shared_buffers=1000 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    31 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=50 -c shared_buffers=500 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    33 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=40 -c shared_buffers=400 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    35 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=30 -c shared_buffers=300 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    37 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=200 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
selecting default max_connections ... 20
sh: line 1:    39 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=16384 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    41 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=8192 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    43 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=4096 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    45 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=3584 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    47 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=3072 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    49 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=2560 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    51 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=2048 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    53 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=1536 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    55 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=1000 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    57 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=900 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    59 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=800 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    61 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=700 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    63 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=600 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    65 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=500 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    67 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=400 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    69 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=300 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    71 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=200 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    73 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=100 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
sh: line 1:    75 Bus error               (core dumped) "/usr/pgsql-16/bin/postgres" --check -F -c log_checkpoints=false -c max_connections=20 -c shared_buffers=50 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
selecting default shared_buffers ... 400kB
selecting default time zone ... UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok


Success. You can now start the database server using:

    /usr/pgsql-16/bin/pg_ctl -D /pgdata/pg16 -l logfile start

initdb: warning: enabling "trust" authentication for local connections
initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.

Step 3: Setting the expected permissions on the old pgdata directory...

Step 4: Copying shared_preload_libraries setting to new postgresql.conf file...

upgrade: line 18: /usr/pgsql-13/bin/postgres: No such file or directory
Step 5: Running pg_upgrade check...


check for "/usr/pgsql-13/bin" failed: No such file or directory
Failure, exiting

real    0m0.039s
user    0m0.004s
sys     0m0.008s

New postgres 16 cannot be started log:

k logs pgsql-ngt-test-pgha1-54bb-0 -n database-services
Defaulted container "database" out of: database, replication-cert-copy, postgres-startup (init), nss-wrapper-init (init)
2024-09-19 10:52:08,477 INFO: No PostgreSQL configuration items changed, nothing to reload.
2024-09-19 10:52:08,629 WARNING: Postgresql is not running.
2024-09-19 10:52:08,629 INFO: Lock owner: None; I am pgsql-ngt-test-pgha1-54bb-0
2024-09-19 10:52:08,637 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202307071
  Database system identifier: 7416304783840694348
  Database cluster state: shut down
  pg_control last modified: Thu Sep 19 10:50:25 2024
  Latest checkpoint location: 0/186DFB0
  Latest checkpoint's REDO location: 0/186DFB0
  Latest checkpoint's REDO WAL file: 000000010000000000000001
  Latest checkpoint's TimeLineID: 1
  Latest checkpoint's PrevTimeLineID: 1
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:730
  Latest checkpoint's NextOID: 13545
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 722
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 0
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Thu Sep 19 10:50:25 2024
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: off
  max_connections setting: 20
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 1
  Mock authentication nonce: 2a28ce363859f2cc621fe10e6b20e7038c6578ba4329e42f3d081523c1b6cd12

2024-09-19 10:52:08,640 INFO: Lock owner: None; I am pgsql-ngt-test-pgha1-54bb-0
2024-09-19 10:52:08,803 INFO: starting as a secondary
2024-09-19 10:52:09,889 INFO: postmaster pid=105
/tmp/postgres:5432 - no response
2024-09-19 10:52:09.920 UTC [105] LOG:  redirecting log output to logging collector process
2024-09-19 10:52:09.920 UTC [105] HINT:  Future log output will appear in directory "log".
2024-09-19 10:52:10,906 ERROR: postmaster is not running
2024-09-19 10:52:18,514 INFO: Lock owner: None; I am pgsql-ngt-test-pgha1-54bb-0
2024-09-19 10:52:18,641 INFO: failed to start postgres
2024-09-19 10:52:28,515 WARNING: Postgresql is not running.
2024-09-19 10:52:28,515 INFO: Lock owner: None; I am pgsql-ngt-test-pgha1-54bb-0
2024-09-19 10:52:28,525 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202307071
  Database system identifier: 7416304783840694348
  Database cluster state: shut down
  pg_control last modified: Thu Sep 19 10:50:25 2024
  Latest checkpoint location: 0/186DFB0
  Latest checkpoint's REDO location: 0/186DFB0
  Latest checkpoint's REDO WAL file: 000000010000000000000001
  Latest checkpoint's TimeLineID: 1
  Latest checkpoint's PrevTimeLineID: 1
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:730
  Latest checkpoint's NextOID: 13545
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 722
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 0
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Thu Sep 19 10:50:25 2024
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: off
  max_connections setting: 20
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 1
  Mock authentication nonce: 2a28ce363859f2cc621fe10e6b20e7038c6578ba4329e42f3d081523c1b6cd12

2024-09-19 10:52:28,526 INFO: Lock owner: None; I am pgsql-ngt-test-pgha1-54bb-0
2024-09-19 10:52:28,527 INFO: starting as a secondary
2024-09-19 10:52:29,565 INFO: postmaster pid=176
2024-09-19 10:52:29.573 UTC [176] LOG:  redirecting log output to logging collector process
2024-09-19 10:52:29.573 UTC [176] HINT:  Future log output will appear in directory "log".
/tmp/postgres:5432 - no response
2024-09-19 10:52:30,585 ERROR: postmaster is not running
2024-09-19 10:52:38,514 INFO: Lock owner: None; I am pgsql-ngt-test-pgha1-54bb-0
2024-09-19 10:52:38,515 INFO: failed to start postgres

When running with PG13 image it also refuses now to start, but was healthy before upgrade:

Defaulted container "database" out of: database, replication-cert-copy, postgres-startup (init), nss-wrapper-init (init)
2024-09-19 10:54:23,103 INFO: No PostgreSQL configuration items changed, nothing to reload.
2024-09-19 10:54:23,118 CRITICAL: system ID mismatch, node pgsql-ngt-test-pgha1-54bb-0 belongs to a different cluster: 7416304783840694348 != 7174435655098040412

REPRO

above

EXPECTED

pgupgrade complete without issues

ACTUAL

pgupgrade failed, and postgres cluster not starting at all with new version and with the old one.

Logs

Above

Additional Information

na

todeb avatar Sep 19 '24 11:09 todeb

for some additional info database with postgresql13 was initialized on host without hugepages, now it is running on host with hugepages enabled. When i scheduled the pgupgrade on host without hugepages there are no bus errors anymore although still upgrade fail.

Performing PostgreSQL upgrade from version "13" to "16" ...

Step 1: Making new pgdata directory...

Step 2: Initializing new pgdata directory...

The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are enabled.

fixing permissions on existing directory /pgdata/pg16 ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
initdb: warning: enabling "trust" authentication for local connections
initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.
syncing data to disk ... ok


Success. You can now start the database server using:

    /usr/pgsql-16/bin/pg_ctl -D /pgdata/pg16 -l logfile start


Step 3: Setting the expected permissions on the old pgdata directory...

Step 4: Copying shared_preload_libraries setting to new postgresql.conf file...

upgrade: line 18: /usr/pgsql-13/bin/postgres: No such file or directory
Step 5: Running pg_upgrade check...


check for "/usr/pgsql-13/bin" failed: No such file or directory
Failure, exiting

real    0m0.070s
user    0m0.000s
sys     0m0.009s

todeb avatar Sep 19 '24 13:09 todeb

even running clean cluster on host without hugepages and trying pgupgrade on it also running pgupgrade on non hugapes host. It just fail. Same log as above.

todeb avatar Sep 19 '24 13:09 todeb

i was able to perform uprade using crunchy-upgrade:ubi8-5.5.0-0 but on 5.6.0 and 5.6.1 upgrade fails.

So I think it is still something that should be addressed. Also these bus errors when running upgrade on host with hugepages.

todeb avatar Sep 19 '24 13:09 todeb

did you tried a step by step upgrade? 13 > 14 > 15 > 16

I just tried to upgrade 15 to 16, this worked fine with PGO 5.6.1

MSandro avatar Oct 09 '24 05:10 MSandro

I tried to 15 and it failed. Pgsql 14 is not avaiable any more in repo. As i do not have the image for it i cannot test upgrade. 15->16 for sure will work. So that is not really helpful answer for that case.

fara-tode avatar Oct 09 '24 07:10 fara-tode

@todeb Sorry to hear you hit this issue. From the logs you provided, I think the reason you are having this issue is due to the 5.6.0 upgrade image not having the required Postgres 13 binary whereas the 5.5.0 image does includes it. You can read more about this in our developer program container versions documentation. For your current scenario, I would therefore recommend using the 5.5.0 image for this upgrade, then moving to the newer images in the future. I hope this helps!

tjmoore4 avatar Oct 18 '24 20:10 tjmoore4

It is what i did. So if that is not issue but expected behavior that you cannot upgrade from older postgresql version having newer pgo and you need downgrade pgo first you can then close it.

todeb avatar Oct 22 '24 07:10 todeb