amundsen
amundsen copied to clipboard
Make BigQuery Usage Extractor Support Relational Metadata Store
Currently usage extractor needs a couple of changes in order to use it with a relational metadata store:
-
GCP console allows running queries using tables from a project different from the one the extractor is used for; e.g We can query tables of Project A from the console of Project B. This will result in "referenced tables" containing tables from a different project. If we use Amundsen with a specific env(google cloud project), the backend relational metadata store will not store tables from a different project leading to a FK constraint violation between table_usage and table_metadata tables.
-
The bigquery usage extractor job will run after bigquery metadata extractor job because of the foreign key dependency between table_usage and table_metadata tables, however if new tables are created and queried while BQ Metadata extractor job is still running, the logging API response might return a few referenced tables that don’t exist yet in table_metadata tables leading to a FK constraint violation.
-
Whenever we run queries on tables that have table decorators the referenced table name has the decorator included in it. e.g
SELECT COUNT(*) FROM [PROJECT_ID:DATASET.TABLE@-3600000]the referenced tables returned by this query will be TABLE@- 3600000, since this is not originally stored in table_metadata table, this will again result in FK constraint violation -
Similarly when date sharded tables are queried (Table_20210503), the "referenced tables" contains the table_name along with the suffix (Table_20210503) but since the metadata extractor does not extract metadata of each sharded table, this will lead to a FK constraint violation.
-
When temporary/ anonymous tables are queried their usage count is also recorded.
-
When we have metadata queries like
select * from my-project.my-dataset.INFORMATION_SCHEMA.TABLES;orselect * from my-project.my-dataset.__TABLES__the "referenced tables" contains INFORMATION_SCHEMA or__TABLES__as the table referred by the query job.
Expected Behavior or Use Case
BigQuery usage extractor should only extract usage for tables present in the table_metadata table.
Service or Ingestion ETL
Databuilder extractors
Possible Implementation
Possible solutions for each of the issues mentioned above:
- Adding a boolean config (with default value False) that dictates if we want to count usage for tables present in other projects as well or limit the usage counts to tables present in the google cloud project configured with the extractor. 'projectID' in "referenced tables" can be used for making this check.
- Adding a Config cutoff-time which can be set to the execution-time of the bigquery metadata task. If not configured, the default for the cutoff-time config can be current time. Filter timestamp of the logging entries -- only fetch entries that have a timestamp less than the config cutoff time.
- Remove table decorators from table Names before recording the usage counts.
- For date sharded tables, count usage for the table_prefix since only the table_prefix is extracted by the bigquery metadata extractor.
- Do not count usage for query jobs that reference anonymous tables/datasets. Anonymous datasets start with '_'
- Filter out queries that contain INFORMATION_SCHEMA or
__TABLES__in them.
Example Screenshots (if appropriate):
Context
We are using BigQuery as the DataWarehouse and MySQL as the metadata store.
cc @crazy-2020
I think the change would be helpful to filter out unneeded items before publishing and make the usage statistic more accurate. Currently, bigquery_usage_extractor would fetch more tables/items than bigquery_metadata_extractor does, and those additional tables/items are never retrieved from search_extractor or metadata service. For the implementation of filtering out non-table items from logging API response, see if others have any concerns or input.
Feel free to create to pr to fix it! thanks