postgresql_cluster icon indicating copy to clipboard operation
postgresql_cluster copied to clipboard

few thoughts about upgrade role

Open glushakov opened this issue 1 year ago • 2 comments

Hi. Thanks for your work. I based your role on a major upgrade of postgres and would like to share some of the cases I encountered.

  1. Patroni precheck https://github.com/vitabaks/postgresql_cluster/blob/master/automation/pg_upgrade.yml#L19 doesn't consider schema to get access to Patroni API (http/https)

  2. Tablespaces.

  • when we have a custom tablespace, then during the execution of the schema precheck creating new TS directory. Then running pg_upgrade precheck, and fail occurs with an error that the directory with the TS is already in use. I chose the solution not to dump the creation of the TS, with the key --no-tablespaces. Otherwise we need to clean up the directories with the TS after the precheck scheme, but I don't like deleting data on production :)
  • I added an additional check that the TS is not inside PGDATA. Because upgrading with hardlinks and then deleting the old directory will be fatal
  1. In case of scheme precheck fail https://github.com/vitabaks/postgresql_cluster/blob/master/automation/roles/upgrade/tasks/schema_compatibility.yml#L109 task with reinit not started https://github.com/vitabaks/postgresql_cluster/blob/master/automation/roles/upgrade/tasks/schema_compatibility.yml#L121 and new pgdata is not cleared and postgres is not stopped, which makes it impossible to restart.

  2. Parameters

  • parameters from old postgresql.base.conf are not transferred in any way. For example, some of our clusters was converted from standalone/replication to patroni and store part of the configuration in postgresql.base.conf
  • DCS cache file removing during https://github.com/vitabaks/postgresql_cluster/blob/master/automation/roles/upgrade/tasks/update_config.yml#L11 I did not understand the reason why you decided to do this, but deleting the file and deleting information from the DCS (patronictl remove) deprives us of the entire dynamic configuration, including important parameters for us - logical replication slots
  • in the initdb https://github.com/vitabaks/postgresql_cluster/blob/master/automation/roles/upgrade/tasks/initdb.yml#L91 not considering available of custom wal_segment_size (for example we are using 64MB)

glushakov avatar Oct 07 '24 14:10 glushakov

Thanks for the feedback @glushakov

If possible, attach more doers, error logs, how to reproduce, an example of the fix code that you mentioned. This will help to implement improvements in a shorter time.

Also feel free to offer PR especially for small fixes.

vitabaks avatar Oct 07 '24 16:10 vitabaks

For some cases there is no code yet, because they are still in the design stage (for example, transferring parameters from postgresql.base.conf).

Errors, for example for existing tablespace error:


$ mkdir ts/my_tblspc
$ psql
postgres=# create tablespace qwe location '/var/lib/pgsql/ts/my_tblspc';

<..cut ..>

TASK [Print tablespace location] ***************************************************************************************************************************************************************************************************************************************************
skipping: [alma-db2]
ok: [alma-db1] => {
    "pg_tablespace_location.stdout_lines": [
        "/var/lib/pgsql/ts/my_tblspc"
    ]
}

<..cut ..>


TASK [Result of checking the compatibility of the scheme - success] ****************************************************************************************************************************************************************************************************************
ok: [alma-db1] => {
    "msg": "The database schema are compatible with PostgreSQL 16"
}

<..cut..>

TASK [Print the result of the pg_upgrade check] ************************************************************************************************************************************************************************************************************************************
fatal: [alma-db1]: FAILED! => {
    "failed_when_result": true,
    "pg_upgrade_check_result.stdout_lines": [
        "Performing Consistency Checks on Old Live Server",
        "------------------------------------------------",
        "Checking cluster versions                                     ok",
        "Checking database user is the install user                    ok",
        "Checking database connection settings                         ok",
        "Checking for prepared transactions                            ok",
        "Checking for system-defined composite types in user tables    ok",
        "Checking for reg* data types in user tables                   ok",
        "Checking for contrib/isn with bigint-passing mismatch         ok",
        "Checking for incompatible \"aclitem\" data type in user tables  ok",
        "Checking for presence of required libraries                   ok",
        "Checking database user is the install user                    ok",
        "Checking for prepared transactions                            ok",
        "Checking for new cluster tablespace directories               ",
        "new cluster tablespace directory already exists: \"/var/lib/pgsql/ts/my_tblspc/PG_16_202307071\"",
        "Failure, exiting"
    ]
}

The "schema precheck" process created a tablespace in the /var/lib/pgsql/ts/my_tblspc/PG_16_202307071 directory, but then did not delete it. Then the pg_upgrade --check process detect that this directory already contained elements from the TS and failed.

As i said earlier, my decision is not to use tablespace in pg_dumpall command for schema checks (add --no-tablespaces)

glushakov avatar Oct 08 '24 13:10 glushakov