Migration error when upgrading marquez from 0.42 -> 0.44
When upgrading our marquez instance, I ran into this set of errors - any help would be appreciated!
Output from the marquez api when spinning up a 0.44 image:
INFO [2024-01-29 19:00:57,000] org.flywaydb.core.internal.command.DbMigrate: Current version of schema "public": 62
INFO [2024-01-29 19:00:57,020] org.flywaydb.core.internal.command.DbMigrate: Migrating schema "public" to version "63 - alter tables add on cascade delete"
INFO [2024-01-29 19:09:40,848] org.flywaydb.core.internal.command.DbMigrate: Migrating schema "public" to version "64 - drop job contexts"
INFO [2024-01-29 19:09:40,882] org.flywaydb.core.internal.command.DbMigrate: Migrating schema "public" to version "65 - alter dataset facets to change lineage event type as nullable"
INFO [2024-01-29 19:09:40,896] org.flywaydb.core.internal.command.DbMigrate: Migrating schema "public" to version "66.1 - alter job facets"
INFO [2024-01-29 19:09:40,928] org.flywaydb.core.internal.command.DbMigrate: Migrating schema "public" to version "66.2 - alter lineage events add event type"
INFO [2024-01-29 19:09:40,942] org.flywaydb.core.internal.command.DbMigrate: Migrating schema "public" to version "66.3 - BackfillJobFacetsWithJobVersion" [non-transactional]
INFO [2024-01-29 19:09:41,040] org.flywaydb.core.internal.command.DbMigrate: Migrating schema "public" to version "67.1 - job versions io mapping add job reference"
INFO [2024-01-29 19:10:06,954] org.flywaydb.core.internal.command.DbMigrate: Migrating schema "public" to version "67.2 - Back fill job_uuid and is_current_job_version in job_versions_io_mapping table" [non-transactional]
ERROR [2024-01-29 19:11:38,810] org.flywaydb.core.internal.command.DbMigrate: Migration of schema "public" to version "67.2 - Back fill job_uuid and is_current_job_version in job_versions_io_mapping table" [non-transactional] failed! Please restore backups and roll back database and code!
INFO [2024-01-29 19:11:38,814] marquez.db.DbMigration: Repairing failed database migration...
INFO [2024-01-29 19:11:38,815] org.flywaydb.core.internal.license.VersionPrinter: Flyway Community Edition 8.5.13 by Redgate
INFO [2024-01-29 19:11:38,815] org.flywaydb.core.internal.license.VersionPrinter: See what's new here: https://flywaydb.org/documentation/learnmore/releaseNotes#8.5.13
INFO [2024-01-29 19:11:38,815] org.flywaydb.core.internal.license.VersionPrinter:
INFO [2024-01-29 19:11:38,859] org.flywaydb.core.internal.command.DbRepair: Successfully repaired schema history table "public"."flyway_schema_history" (execution time 00:00.038s).
INFO [2024-01-29 19:11:38,859] marquez.db.DbMigration: Successfully repaired database.
INFO [2024-01-29 19:11:38,859] marquez.MarquezApp: Stopping app...
This appears to be the error in postgres:
2024-01-29 21:56:19 UTC:10.100.150.112(53888):postgres@marquez:[1174]:ERROR: duplicate key value violates unique constraint "job_versions_io_mapping_mapping_pkey"
2024-01-29 21:56:19 UTC:10.100.150.112(53888):postgres@marquez:[1174]:DETAIL: Key (job_version_uuid, dataset_uuid, io_type, job_uuid)=(b25fa767-7be2-4458-9af5-8dc6a461f76a, f1cd01aa-7185-495e-ae18-e0aecb22c924, INPUT, e4ffa634-d786-4013-9faf-51dcd60859a0) already exists.
Thanks for opening your first issue in the Marquez project! Please be sure to follow the issue template!
Thanks for wanted to try the latest and greatest of Marquez, but sorry you ran into this migration issue. We will get back to you later this week after we've had a chance to look into this further.
That's interesting, as migration should only add new column to existing primary key, which shall be unique.
Could you check rows within job_versions_io_mapping table with job_version_uuid =b25fa767-7be2-4458-9af5-8dc6a461f76a and dataset_uuid = f1cd01aa-7185-495e-ae18-e0aecb22c924 ?
Thats for taking a look! Here is what I've found:
SELECT * FROM public.job_versions_io_mapping as t
WHERE t.job_version_uuid = 'b25fa767-7be2-4458-9af5-8dc6a461f76a'
AND t.dataset_uuid = 'f1cd01aa-7185-495e-ae18-e0aecb22c924'
Yields 3 rows
| job_version_uuid | dataset_uuid | io_type | job_uuid | job_symlink_target_uuid | is_current_job_version | made_current_at |
|---|---|---|---|---|---|---|
| b25fa767-7be2-4458-9af5-8dc6a461f76a | f1cd01aa-7185-495e-ae18-e0aecb22c924 | INPUT | null | null | false | null |
| b25fa767-7be2-4458-9af5-8dc6a461f76a | f1cd01aa-7185-495e-ae18-e0aecb22c924 | INPUT | null | null | false | null |
| b25fa767-7be2-4458-9af5-8dc6a461f76a | f1cd01aa-7185-495e-ae18-e0aecb22c924 | INPUT | null | null | false | null |
Another thing to note is that this database definitely might have some bogus data in it. This deployment is pretty old and we haven't yet run the data cleanup job on it - any guidance on how to trace down offending versions? It is certainly possible old test events or malformed events made its way in in previous versions and just have been sitting around.
Okay looking at this more closely - it seems like the job_uuid is null, which almost assuredly is not expected
it's something that I not assumed within the migration. @wslulciuc Is this a possible outcome of some retention mechanism which cleaned job table but did not clean job_versions nor job_version_mapping. The question is: is it OK for a user to have such rows in DB.
The obvious solution would be to run DELETE FROM job_versions_io_mapping WHERE job_uuid IS NULL, but it's better to know how come is that possible?
It is possible that the migration went awry once and wasn't completely cleaned up - does this migration roll back if it fails?
Also - it looks like every row in job_versions_io_mapping has job_uuid = NULL
In 67.1 there is this snippet:
-- To add job_uuid to the unique constraint, we first drop the primary key, then recreate it; note given that job_version_uuid can be NULL, we need to check that job_version_uuid != NULL before inserting (duplicate columns otherwise)
ALTER TABLE job_versions_io_mapping DROP CONSTRAINT job_versions_io_mapping_pkey;
ALTER TABLE job_versions_io_mapping ALTER COLUMN job_version_uuid DROP NOT NULL;
so maybe something went wrong here?