amundsen
amundsen copied to clipboard
Make BigQuery Watermark Extractor support Relational Metadata store
When we have a relational DB as metadata store we would have to run bigquery metadata extractor job followed by watermark extractor job in order to follow the foreign key dependency between table_metadata and table_watermark tables. However, while bigquery metadata extractor job is running, if new tables are added in the google cloud project that is configured with the extractors, then the watermark extractor might extract watermark for tables not yet existing in the table_metadata table, which will lead to a FK constraint violation.
Expected Behavior or Use Case
Watermark extractor should only extract metadata for tables created before the execution-time of the bigquery metadata extractor in order to avoid fetching tables that might lead to a FK constraint violation.
Service or Ingestion ETL
Databuilder Extractors
Possible Implementation
One possible solution is to have a Config cutoff-time which can be set to the actual execution-time of the bigquery metadata task. If not configured, the default for the cutoff-time config can be current time. The watermark extractor can check if the table creation time (https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#Table) is less than the cutoff-time and only extract watermark metadata for those tables.
Example Screenshots (if appropriate):
Context
We are using BigQuery as the DataWarehouse and MySQL as the metadata store.
cc @crazy-2020
Thanks, I do not have any concern. The cutoff-time would work for both graph db and mysql users and its default value can work for the main graph db users.
Feel free to create to pr to fix it! thanks
make sense, FYI, for graph db, if the table only appears after the extractor has run, the watermark node will be a stale node which won't affect the UI. But I could see it could be an issue for rdbms case.