Detect debian/alpine image mismatch
From README:
The default official Docker PostgreSQL image is Debian Linux based, and upgrading from that to one of our Alpine Linux based images doesn't always work out well.
Would it be possible to detect a situation where a Debian-based PostgreSQL is being upgraded with an Alpine-based pgautoupgrade, or vice-versa, and refuse to run in that case?
Context: I recently received a report from a user bitten by the debian/alpine mismatch. I can add a warning message in my project's documentation, but I think a better fix would be for pgautoupgrade to refuse to run (or at least ask the user to confirm) when it can detect potential breakage ahead.
That's a good idea, but personally I have no idea how we'd do the detection as I don't think the needed info is recorded anywhere. 😬
@wuast94 @andyundso @spwoodcock Any idea how we could potentially detect this?
maybe, not sure.
so Postgres has a table named pg_collation:
postgres=# SELECT * FROM pg_collation;
oid | collname | collnamespace | collowner | collprovider | collisdeterministic | collencoding | collcollate | collctype | colliculocale | collicurules | collversion
-------+------------------------+---------------+-----------+--------------+---------------------+--------------+-------------+------------+------------------+--------------+-------------
100 | default | 11 | 10 | d | t | -1 | | | | |
950 | C | 11 | 10 | c | t | -1 | C | C | | |
951 | POSIX | 11 | 10 | c | t | -1 | POSIX | POSIX | | |
962 | ucs_basic | 11 | 10 | c | t | 6 | C | C | | |
963 | unicode | 11 | 10 | i | t | -1 | | | und | | 153.120
12344 | C.utf8 | 11 | 10 | c | t | 6 | C.utf8 | C.utf8 | | |
12345 | en_US.utf8 | 11 | 10 | c | t | 6 | en_US.utf8 | en_US.utf8 | | | 2.36
12346 | en_US | 11 | 10 | c | t | 6 | en_US.utf8 | en_US.utf8 | | | 2.36
that list is incredible long, but you can see the version 2.36 for en_US, which corresponds to the libc version in my container:
root@73592e642f29:/# dpkg -l | grep libc
ii libc-bin 2.36-9+deb12u3 amd64 GNU C Library: Binaries
ii libc-l10n 2.36-9+deb12u3 all GNU C Library: localization files
a plain Alpine container in my tests does not have an en_US.utf8.
but there is a query documented on DBA exchange which can detect objects with collations that need an upgrade. need to check further what this query reports when changing the base OS.
Hmmm, I wonder if the problem itself could be fixed by rebuilding the objects (indexes, etc) then running the ALTER COLLATION name REFRESH VERSION command, as per the note here?
https://www.postgresql.org/docs/16//sql-altercollation.html#SQL-ALTERCOLLATION-NOTES
The notes on that page also say:
For the database default collation, there is an analogous command
ALTER DATABASE ... REFRESH COLLATION VERSION.
and:
The following query can be used to identify all collations in the current database that need to be refreshed and the objects that depend on them:
SELECT pg_describe_object(refclassid, refobjid, refobjsubid) AS "Collation",
pg_describe_object(classid, objid, objsubid) AS "Object"
FROM pg_depend d JOIN pg_collation c
ON refclassid = 'pg_collation'::regclass AND refobjid = c.oid
WHERE c.collversion <> pg_collation_actual_version(c.oid)
ORDER BY 1, 2;
If we can get this working reliably, then we might not have to be so concerned about the Alpine vs Debian base image mismatch. :smile:
Preamble
This seems like a very complex problem!
Interesting read from research: http://rhaas.blogspot.com/2021/11/collation-stability.html
Also useful info from the official postgres container images readme:
which suggests we might have more luck with this on Postgres > v15, especially if based on alpine.
But the blog above suggests that postgres maintainers have abandoned approaches to determine this automatically, as it's probably too flaky and variable between systems and versions 😅
At least one positive for us is that we only have to consider Linux Debian/Alpine for containers, so it could be achievable. (plus in theory each container image should have pinned versions of the collations we can work with)
Approach A
As suggested, we could do the upgrade regardless, but then just try to fix the collation issues.
Could work!
Approach B
1. Determine Collation Provider
From the pg_collation table and LC_COLLATE & LC_CTYPE.
SELECT
(SELECT setting FROM pg_settings WHERE name = 'lc_collate') AS lc_collate,
(SELECT setting FROM pg_settings WHERE name = 'lc_ctype') AS lc_ctype,
(SELECT DISTINCT collprovider FROM pg_collation WHERE collname = 'en_US.UTF-8' OR collname = 'C') AS collprovider;
collprovider |
lc_collate Output |
Likely OS/Library |
|---|---|---|
c |
en_US.UTF-8 |
GLIBC (Debian) |
c |
C.UTF-8 |
MUSL (Alpine) |
i |
en-US-u-co-icu |
ICU-based |
b |
C or POSIX |
Builtin (C collation) |
I assume on systems that aren't using
en_US.UTF-8we can probably substitute this from theLANGenv variable or similar? But the official container images do default toen_US.UTF-8unless explicitly set otherwise.
2. Determine Current C Lib
- If
collprovider=i, the upgrade should likely not have issues and we can continue. - If
collprovider=c, we need to determine if the system C lib is compatible:
GLIBC: if ldd --version 2>/dev/null | grep -q 'GLIBC'; then echo 1; else echo 0; fi
MUSL: if ldd /bin/sh 2>&1 | grep -q 'musl'; then echo 1; else echo 0; fi
3. Decide Upgrade Strategy
- If ICU-based, the upgrade should likely not have issues, so continue (as above).
- If mismatch between collation c-lib and system c-lib, either abort, or use a pg_dump / restore approach?
- If a match, continue with the upgrade.
@justinclift I tried that; that SELECT returned nothing and I was in the pickle of what to do, as theoretically rebuilding indexes should fix it. I went as far as PgSQL code and it seems like it may be a "bug" in the DB engine itself. I'm saying this in quotation marks as you generally shouldn't use data directory between musl and glibc, it's one of these you did what?!-situations. Because of this I'm hesitant to bother upstream.
PgSQL walks in circles in this case: doing REINDEX SYSTEM <dbName> + REINDEX DATABASE <dbName> on both postgres and application database seems to be doing what is supposed to. However, even after that ALTER DATABASE ... REFRESH COLLATION VERSION refuses to run with invalid collation version change.
An article in PgSQL wiki is even more explicit about dangers of locale data changes and chaos that it can unleash; to summarize:
-
(...) it is essential to the correct operation of a database that the locale definitions do not change incompatibly during the lifetime of a database.
-
(...) PostgreSQL currently has no way to detect an incompatible glibc update.
- Surprise locale "update" can corrupt replicas, affect partitions, and break other things (see "What is affected" section)
The rabbit hole goes even deeper. The collation differences appear to be a major danger in replication scenarios, and even glibc version changes can mess things a lot.
Given all that, it appears the problem is a bit bigger than just indexes and using binary data between different C runtimes can create a ticking-bomb disaster.
Edit
...as @spwoodcock posted at the exact same time :) I think the only safe approach is dump + restore, as it appears that no binary data is trustworthy between runtimes. I also found a PgSQL wiki page about collations, that links to a long mailing list thread where they tried to tackle a similar challange. There was even a patch it seems in v14 that was reverted. This to me suggests the problem is much more complex to deal with, especially when partitioning is involved, and the invalid collation version change is just a red herring where the engine refuses to potentially make the problem even worse.
Cool. :smile:
Does it seem like we could use pieces of @spwoodcock's "Approach B" to determine if the underlying image type has changed, and if it has then we run a warning+abort routine instead of attempting to fix the situation ourselves?
As a data point, when I upgraded my desktop system from Debian 12 (Bookworm) to Debian 13 (Trixie) today, it performed an upgrade of the PG database to PG 17.
While doing so, it displayed this warning message for each of the databases present:
WARNING: database "postgres" has a collation version mismatch
DETAIL: The database was created using collation version 2.36, but the operating system provides version 2.41.
HINT: Rebuild all objects in this database that use the default collation and run ALTER DATABASE postgres REFRESH COLLATION VERSION, or build PostgreSQL with the right library version.
Before seeing that I wasn't sure whether the collation version is stored anywhere, but from seeing that message it clearly is.