datahub icon indicating copy to clipboard operation
datahub copied to clipboard

feat(Metabase): collection inheritance and integration improvements with BigQuery

Open PatrickfBraz opened this issue 2 years ago • 3 comments

Context and Motivation

Metabase ingestion is functional and meets most of the cases. However, for the use case of a company I work for, it wasn't working properly. The behavior of extracting information when the engine was BigQuery did not work as expected. It was impossible to create data lineage between what was entered from the BigQuery ingestion and the information provided by the Metabase ingestion. In this sense, the objective of this PR is to improve the ingestion when the data source is BigQuery and to bring new functionality to create Containers based on Metabase collections.

Relevant remarks

  • In creating the platform, the official logo of the Metabase found at https://www.metabase.com/images/logo.svg was used
  • The previous ingestion code did not consider cases where a card can be sourced from one or more cards. In this sense, the method that extracts the lineage does a recursive search to find the source tables.
  • The previous code did not consider that in the case of BigQuery the paths to the tables were given by <project_id>.<dataset_id>.<table_id> and therefore it was not possible to build the lineage. Now the project_id is considered in the construction of the Datasets URNs

Improvements

Some images to demonstrate the results

It is now possible to create a Container based on Collections image The container inherits the information from the collection image Can now establish lineage with BigQuery tables inserted by the ingestion provided by the DataHub itself (backward compatibility) image

Checklist

  • [x] The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • [ ] Links to related issues (if applicable)
  • [ ] Tests for the changes have been added/updated (if applicable)
  • [x] Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • [ ] For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

PatrickfBraz avatar Apr 14 '22 18:04 PatrickfBraz

@Mohd-gslab Can you take a look at this one, please?

maggiehays avatar Apr 14 '22 21:04 maggiehays

Unit Test Results (build & test)

  98 files  +  2    98 suites  +2   25m 51s :stopwatch: + 7m 56s 718 tests +29  659 :heavy_check_mark: +43  59 :zzz:  - 8  0 :x:  - 6 

Results for commit fba79204. ± Comparison against base commit 61dc6e87.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Apr 14 '22 21:04 github-actions[bot]

Unit Test Results (metadata ingestion)

       4 files   -     1         4 suites   - 1   58m 53s :stopwatch: +31s    436 tests +    6     436 :heavy_check_mark: +    6    0 :zzz: ±  0  0 :x: ±0  1 694 runs   - 381  1 645 :heavy_check_mark:  - 364  49 :zzz:  - 17  0 :x: ±0 

Results for commit fba79204. ± Comparison against base commit 61dc6e87.

This pull request removes 1 and adds 7 tests. Note that renamed tests count towards both.
tests.integration.feast.test_feast ‑ test_feast_ingest
tests.integration.feast-legacy.test_feast ‑ test_feast_ingest
tests.integration.feast.test_feast_repository ‑ test_feast_repository_ingest
tests.unit.test_bq_get_partition_range ‑ test_get_partition_range_from_partition_id
tests.unit.test_pipeline.TestPipeline ‑ test_configure_without_sink
tests.unit.test_snowflake_source ‑ test_account_id_is_added_when_host_port_is_present
tests.unit.test_snowflake_source ‑ test_snowflake_source_throws_error_on_account_id_missing
tests.unit.test_utilities ‑ test_with_keyword_data

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Apr 14 '22 21:04 github-actions[bot]

@PatrickfBraz Why did you close?

oristides avatar Sep 12 '22 14:09 oristides

I closed this pull request because of the reasons:

  1. Metadata ingestion framework has several updates. This code was implemented for version 0.8.37 and an older version of Metabase
  2. I'm working on a more significant improvement tested in my company's deployment.

In our use case, the data source for Metabase is BigQuery, and the actual version doesn't deal well with it. Because of that, we are planning to open a pull request with our improvements. Maybe next months I will open other pull request.

PatrickfBraz avatar Sep 12 '22 14:09 PatrickfBraz