marquez icon indicating copy to clipboard operation
marquez copied to clipboard

Migration error when upgrading marquez from 0.42 -> 0.44

Open mattwparas opened this issue 1 year ago • 10 comments

When upgrading our marquez instance, I ran into this set of errors - any help would be appreciated!

Output from the marquez api when spinning up a 0.44 image:

INFO  [2024-01-29 19:00:57,000] org.flywaydb.core.internal.command.DbMigrate: Current version of schema "public": 62
INFO  [2024-01-29 19:00:57,020] org.flywaydb.core.internal.command.DbMigrate: Migrating schema "public" to version "63 - alter tables add on cascade delete"
INFO  [2024-01-29 19:09:40,848] org.flywaydb.core.internal.command.DbMigrate: Migrating schema "public" to version "64 - drop job contexts"
INFO  [2024-01-29 19:09:40,882] org.flywaydb.core.internal.command.DbMigrate: Migrating schema "public" to version "65 - alter dataset facets to change lineage event type as nullable"
INFO  [2024-01-29 19:09:40,896] org.flywaydb.core.internal.command.DbMigrate: Migrating schema "public" to version "66.1 - alter job facets"
INFO  [2024-01-29 19:09:40,928] org.flywaydb.core.internal.command.DbMigrate: Migrating schema "public" to version "66.2 - alter lineage events add event type"
INFO  [2024-01-29 19:09:40,942] org.flywaydb.core.internal.command.DbMigrate: Migrating schema "public" to version "66.3 - BackfillJobFacetsWithJobVersion" [non-transactional]
INFO  [2024-01-29 19:09:41,040] org.flywaydb.core.internal.command.DbMigrate: Migrating schema "public" to version "67.1 - job versions io mapping add job reference"
INFO  [2024-01-29 19:10:06,954] org.flywaydb.core.internal.command.DbMigrate: Migrating schema "public" to version "67.2 - Back fill job_uuid and is_current_job_version in job_versions_io_mapping table" [non-transactional]
ERROR [2024-01-29 19:11:38,810] org.flywaydb.core.internal.command.DbMigrate: Migration of schema "public" to version "67.2 - Back fill job_uuid and is_current_job_version in job_versions_io_mapping table" [non-transactional] failed! Please restore backups and roll back database and code!
INFO  [2024-01-29 19:11:38,814] marquez.db.DbMigration: Repairing failed database migration...
INFO  [2024-01-29 19:11:38,815] org.flywaydb.core.internal.license.VersionPrinter: Flyway Community Edition 8.5.13 by Redgate
INFO  [2024-01-29 19:11:38,815] org.flywaydb.core.internal.license.VersionPrinter: See what's new here: https://flywaydb.org/documentation/learnmore/releaseNotes#8.5.13
INFO  [2024-01-29 19:11:38,815] org.flywaydb.core.internal.license.VersionPrinter:
INFO  [2024-01-29 19:11:38,859] org.flywaydb.core.internal.command.DbRepair: Successfully repaired schema history table "public"."flyway_schema_history" (execution time 00:00.038s).
INFO  [2024-01-29 19:11:38,859] marquez.db.DbMigration: Successfully repaired database.
INFO  [2024-01-29 19:11:38,859] marquez.MarquezApp: Stopping app...

This appears to be the error in postgres:

2024-01-29 21:56:19 UTC:10.100.150.112(53888):postgres@marquez:[1174]:ERROR: duplicate key value violates unique constraint "job_versions_io_mapping_mapping_pkey"
2024-01-29 21:56:19 UTC:10.100.150.112(53888):postgres@marquez:[1174]:DETAIL: Key (job_version_uuid, dataset_uuid, io_type, job_uuid)=(b25fa767-7be2-4458-9af5-8dc6a461f76a, f1cd01aa-7185-495e-ae18-e0aecb22c924, INPUT, e4ffa634-d786-4013-9faf-51dcd60859a0) already exists.

mattwparas avatar Jan 29 '24 22:01 mattwparas

Thanks for opening your first issue in the Marquez project! Please be sure to follow the issue template!

boring-cyborg[bot] avatar Jan 29 '24 22:01 boring-cyborg[bot]

Thanks for wanted to try the latest and greatest of Marquez, but sorry you ran into this migration issue. We will get back to you later this week after we've had a chance to look into this further.

wslulciuc avatar Jan 30 '24 10:01 wslulciuc

That's interesting, as migration should only add new column to existing primary key, which shall be unique. Could you check rows within job_versions_io_mapping table with job_version_uuid =b25fa767-7be2-4458-9af5-8dc6a461f76a and dataset_uuid = f1cd01aa-7185-495e-ae18-e0aecb22c924 ?

pawel-big-lebowski avatar Jan 30 '24 12:01 pawel-big-lebowski

Thats for taking a look! Here is what I've found:

SELECT * FROM public.job_versions_io_mapping as t
         WHERE t.job_version_uuid = 'b25fa767-7be2-4458-9af5-8dc6a461f76a'
           AND t.dataset_uuid = 'f1cd01aa-7185-495e-ae18-e0aecb22c924'

Yields 3 rows

job_version_uuid dataset_uuid io_type job_uuid job_symlink_target_uuid is_current_job_version made_current_at
b25fa767-7be2-4458-9af5-8dc6a461f76a f1cd01aa-7185-495e-ae18-e0aecb22c924 INPUT null null false null
b25fa767-7be2-4458-9af5-8dc6a461f76a f1cd01aa-7185-495e-ae18-e0aecb22c924 INPUT null null false null
b25fa767-7be2-4458-9af5-8dc6a461f76a f1cd01aa-7185-495e-ae18-e0aecb22c924 INPUT null null false null

mattwparas avatar Jan 30 '24 21:01 mattwparas

Another thing to note is that this database definitely might have some bogus data in it. This deployment is pretty old and we haven't yet run the data cleanup job on it - any guidance on how to trace down offending versions? It is certainly possible old test events or malformed events made its way in in previous versions and just have been sitting around.

mattwparas avatar Jan 30 '24 21:01 mattwparas

Okay looking at this more closely - it seems like the job_uuid is null, which almost assuredly is not expected

mattwparas avatar Jan 30 '24 22:01 mattwparas

it's something that I not assumed within the migration. @wslulciuc Is this a possible outcome of some retention mechanism which cleaned job table but did not clean job_versions nor job_version_mapping. The question is: is it OK for a user to have such rows in DB.

The obvious solution would be to run DELETE FROM job_versions_io_mapping WHERE job_uuid IS NULL, but it's better to know how come is that possible?

pawel-big-lebowski avatar Jan 31 '24 09:01 pawel-big-lebowski

It is possible that the migration went awry once and wasn't completely cleaned up - does this migration roll back if it fails?

mattwparas avatar Jan 31 '24 15:01 mattwparas

Also - it looks like every row in job_versions_io_mapping has job_uuid = NULL

mattwparas avatar Jan 31 '24 17:01 mattwparas

In 67.1 there is this snippet:

-- To add job_uuid to the unique constraint, we first drop the primary key, then recreate it; note given that job_version_uuid can be NULL, we need to check that job_version_uuid != NULL before inserting (duplicate columns otherwise)
ALTER TABLE job_versions_io_mapping DROP CONSTRAINT job_versions_io_mapping_pkey;
ALTER TABLE job_versions_io_mapping ALTER COLUMN job_version_uuid DROP NOT NULL;

so maybe something went wrong here?

mattwparas avatar Jan 31 '24 17:01 mattwparas