Handling of sudden crash could be improved

Open dwyart opened this issue 2 months ago • 1 comments

Running in a Kubernetes environment, we had the PGautoupgrade initContainer fail during the migration, so the container died (but the filesystem was modified). It was then restarted automatically, but:

if [ -f "$UPGRADE_LOCK_FILE" ]; then
  echo "Upgrade lock file already exists, indicating an incomplete previous upgrade. Exiting."
  exit 1
fi

if ! _is_sourced; then
	_main "$@"
fi

did not work, because mv -v "${PGDATA}"/* "${OLD}" also moves the lockfile

	if [ -s "$PGDATA/PG_VERSION" ]; then
		DATABASE_ALREADY_EXISTS='true'
	fi

also doesn't work as again mv has moved PG_VERSION inside old/, so we end up trying docker_init_database_dir, which is not what we want for a "migration only" scenario.

Nov 13 '25 12:11 dwyart

I generally agree that pgautoupgrade could restore the state of the data directory when it exits. pg_upgrade leaves the existing data untouched if I am informed correctly.

I am a bit unsure if I have capacity to implement this soon. Because I think it's not that trivial and needs proper testing. Somewhere EXIT needs to be trapped, the "move back" operation needs to occur and the error code needs to be re-raised in order for Kubernetes to register the failure. Plus there is the special scenario for Postgres v18 where we cannot move things back to $PGDATA since the data could also be mounted at /var/lib/postgresql. There is also a potential race-condition with our healthcheck that if we move back the data, the lock file is "gone", then we query the Postgres database, which is not there, and the container might be terminated faster than our cleanup had the chance to run.

Nov 20 '25 08:11 andyundso