amundsen icon indicating copy to clipboard operation
amundsen copied to clipboard

Adds ability to extract metadata from Google BigQuery using SQLAlchemy

Open avelasquezhe84 opened this issue 3 years ago • 5 comments
trafficstars

feat – Adds ability to extract metadata from Google BigQuery using SQLAlchemy

Summary of Changes

Based on the Metadata Extractor for Snowflake, I created a metadata extractor for Google BigQuery, which queries the INFORMATION_SCHEMA instead of looping through all the projects and tables in the database.

  1. Addition of the CREDS_PATH parameter to the databuilder/databuilder/extractor/sql_alchemy_extractor.py file
  2. Addition of required libraries to the databuilder/setup.py file
  3. Creation of the databuilder/databuilder/extractor/bigquery_sql_alchemy_metadata_extractor.py file
  4. Creation of the databuilder/example/scripts/sample_bigquery_sql_alchemy_data_loader.py implementation file.

Tests

Implemented this feature successfully in a client's project.

Documentation

Similar to the Snowflake Metadata Extractor, in order to extract metadata from BigQuery using this method, the following parameters need to be provided by modifying the databuilder/example/scripts/sample_bigquery_sql_alchemy_data_loader.py file:

  • BIGQUERY_PROJECT_KEY = 'YourBigqueryProjectId'
  • BIGQUERY_TABLE_SCHEMA_KEY = 'YourBigqueryTableSchemaName'
  • BIGQUERY_CONNECTION_STRING = f'bigquery://{BIGQUERY_PROJECT_KEY}'
  • BIGQUERY_CREDENTIALS_PATH = ''

Refer this doc: https://github.com/googleapis/python-bigquery-sqlalchemy for supported connection parameters and configurations

CheckList

  • [X] PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
  • [X] PR includes a summary of changes.
  • [X] PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
  • [X] In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does

avelasquezhe84 avatar Feb 28 '22 21:02 avelasquezhe84

Congratulations on your first Pull Request and welcome to Amundsen community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/amundsen-io/amundsen/blob/main/CONTRIBUTING.md)

boring-cyborg[bot] avatar Feb 28 '22 21:02 boring-cyborg[bot]

cc @xuan616

feng-tao avatar Feb 28 '22 21:02 feng-tao

That's correct

On Mon, Feb 28, 2022, 20:06 Xuan @.***> wrote:

@.**** commented on this pull request.

In databuilder/databuilder/extractor/bigquery_sql_alchemy_metadata_extractor.py https://github.com/amundsen-io/amundsen/pull/1740#discussion_r816371448:

  •    self._database = conf.get_string(BigQueryMetadataExtractor.DATABASE_KEY)
    
  •    self._schema = conf.get_string(BigQueryMetadataExtractor.DATABASE_KEY)
    
  •    self._bigquery_project = conf.get_string(BigQueryMetadataExtractor.BIGQUERY_PROJECT_KEY)
    
  •    self._bigquery_table_schema = conf.get_string(BigQueryMetadataExtractor.BIGQUERY_TABLE_SCHEMA_KEY)
    
  •    self._bigquery_schema = conf.get_string(BigQueryMetadataExtractor.BIGQUERY_SCHEMA_KEY)
    
  •    self.sql_stmt = BigQueryMetadataExtractor.SQL_STATEMENT.format(
    
  •        where_clause_suffix=conf.get_string(BigQueryMetadataExtractor.WHERE_CLAUSE_SUFFIX_KEY),
    
  •        project=self._bigquery_project,
    
  •        table_schema=self._bigquery_table_schema,
    
  •        schema=self._bigquery_schema
    
  •    )
    
  •    LOGGER.info('SQL for bigquery metadata: %s', self.sql_stmt)
    
  •    self._alchemy_extractor = sql_alchemy_extractor.from_surrounding_config(conf, self.sql_stmt)
    

does it support only keyfile path for now

— Reply to this email directly, view it on GitHub https://github.com/amundsen-io/amundsen/pull/1740#pullrequestreview-895754086, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA2ZBKFBFPGF2MLEYFPBI3U5QLX7ANCNFSM5PSGUIKA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

avelasquezhe84 avatar Mar 01 '22 01:03 avelasquezhe84

LGTM. Hi @avelasquezhe84 Can I know what is the difference from the current bigquery_metadata_extrator in terms of the results extracted when we're using sqlalchemy/information_schema, like the filter for sharded tables. Thanks

BTW, it would be better if there are unit tests for this new extractor. :smiley:

xuan616 avatar Mar 01 '22 01:03 xuan616

+1 on the unit test

feng-tao avatar Mar 02 '22 05:03 feng-tao