amundsen icon indicating copy to clipboard operation
amundsen copied to clipboard

Make BigQuery Watermark Extractor support Relational Metadata store

Open sahithi03 opened this issue 4 years ago • 4 comments

When we have a relational DB as metadata store we would have to run bigquery metadata extractor job followed by watermark extractor job in order to follow the foreign key dependency between table_metadata and table_watermark tables. However, while bigquery metadata extractor job is running, if new tables are added in the google cloud project that is configured with the extractors, then the watermark extractor might extract watermark for tables not yet existing in the table_metadata table, which will lead to a FK constraint violation.

Expected Behavior or Use Case

Watermark extractor should only extract metadata for tables created before the execution-time of the bigquery metadata extractor in order to avoid fetching tables that might lead to a FK constraint violation.

Service or Ingestion ETL

Databuilder Extractors

Possible Implementation

One possible solution is to have a Config cutoff-time which can be set to the actual execution-time of the bigquery metadata task. If not configured, the default for the cutoff-time config can be current time. The watermark extractor can check if the table creation time (https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#Table) is less than the cutoff-time and only extract watermark metadata for those tables.

Example Screenshots (if appropriate):

Context

We are using BigQuery as the DataWarehouse and MySQL as the metadata store.

sahithi03 avatar May 10 '21 23:05 sahithi03

cc @crazy-2020

feng-tao avatar May 11 '21 06:05 feng-tao

Thanks, I do not have any concern. The cutoff-time would work for both graph db and mysql users and its default value can work for the main graph db users.

xuan616 avatar May 11 '21 22:05 xuan616

Feel free to create to pr to fix it! thanks

feng-tao avatar May 14 '21 04:05 feng-tao

make sense, FYI, for graph db, if the table only appears after the extractor has run, the watermark node will be a stale node which won't affect the UI. But I could see it could be an issue for rdbms case.

feng-tao avatar May 19 '21 04:05 feng-tao