datahub
datahub copied to clipboard
feat(Metabase): collection inheritance and integration improvements with BigQuery
Context and Motivation
Metabase ingestion is functional and meets most of the cases. However, for the use case of a company I work for, it wasn't working properly. The behavior of extracting information when the engine was BigQuery did not work as expected. It was impossible to create data lineage between what was entered from the BigQuery ingestion and the information provided by the Metabase ingestion. In this sense, the objective of this PR is to improve the ingestion when the data source is BigQuery and to bring new functionality to create Containers based on Metabase collections.
Relevant remarks
- In creating the platform, the official logo of the Metabase found at
https://www.metabase.com/images/logo.svg
was used - The previous ingestion code did not consider cases where a card can be sourced from one or more cards. In this sense, the method that extracts the lineage does a recursive search to find the source tables.
- The previous code did not consider that in the case of BigQuery the paths to the tables were given by <project_id>.<dataset_id>.<table_id> and therefore it was not possible to build the lineage. Now the project_id is considered in the construction of the Datasets URNs
Improvements
Some images to demonstrate the results
It is now possible to create a Container based on Collections
The container inherits the information from the collection
Can now establish lineage with BigQuery tables inserted by the ingestion provided by the DataHub itself (backward compatibility)
Checklist
- [x] The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
- [ ] Links to related issues (if applicable)
- [ ] Tests for the changes have been added/updated (if applicable)
- [x] Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
- [ ] For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub
@Mohd-gslab Can you take a look at this one, please?
Unit Test Results (build & test)
98 files + 2 98 suites +2 25m 51s :stopwatch: + 7m 56s 718 tests +29 659 :heavy_check_mark: +43 59 :zzz: - 8 0 :x: - 6
Results for commit fba79204. ± Comparison against base commit 61dc6e87.
:recycle: This comment has been updated with latest results.
Unit Test Results (metadata ingestion)
4 files - 1 4 suites - 1 58m 53s :stopwatch: +31s 436 tests + 6 436 :heavy_check_mark: + 6 0 :zzz: ± 0 0 :x: ±0 1 694 runs - 381 1 645 :heavy_check_mark: - 364 49 :zzz: - 17 0 :x: ±0
Results for commit fba79204. ± Comparison against base commit 61dc6e87.
This pull request removes 1 and adds 7 tests. Note that renamed tests count towards both.
tests.integration.feast.test_feast ‑ test_feast_ingest
tests.integration.feast-legacy.test_feast ‑ test_feast_ingest
tests.integration.feast.test_feast_repository ‑ test_feast_repository_ingest
tests.unit.test_bq_get_partition_range ‑ test_get_partition_range_from_partition_id
tests.unit.test_pipeline.TestPipeline ‑ test_configure_without_sink
tests.unit.test_snowflake_source ‑ test_account_id_is_added_when_host_port_is_present
tests.unit.test_snowflake_source ‑ test_snowflake_source_throws_error_on_account_id_missing
tests.unit.test_utilities ‑ test_with_keyword_data
:recycle: This comment has been updated with latest results.
@PatrickfBraz Why did you close?
I closed this pull request because of the reasons:
- Metadata ingestion framework has several updates. This code was implemented for version 0.8.37 and an older version of Metabase
- I'm working on a more significant improvement tested in my company's deployment.
In our use case, the data source for Metabase is BigQuery, and the actual version doesn't deal well with it. Because of that, we are planning to open a pull request with our improvements. Maybe next months I will open other pull request.