datahub icon indicating copy to clipboard operation
datahub copied to clipboard

feat(ingest): powerbi # powerbi odbc dataset should be ingested

Open looppi opened this issue 3 years ago • 1 comments

When there's ODBC data sources in PowerBI, the current ingestion skips them, because the PowerBI REST API won't provide any tables for the data source. In order to get at least the table names for the data source, additional REST call to executeQueries API has to be made.

This PR adds functionality to get the data source schema and additionally ingests datasets that are not directly used in dashboards or reports.

Tests are changed to match the newly introduced functionality.

Open questions:

  • How to handle database and schema name when the we only have connectionString present for data source?
  • ~~Is it really necessary to store data sources both as dict and instances of DataSource dataclass?~~ There was an upstream change which made the implementation clear.

Checklist

  • [x] The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • [x] Links to related issues (if applicable)
  • [x] Tests for the changes have been added/updated (if applicable)
  • [x] Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • [x] For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

looppi avatar Nov 15 '22 08:11 looppi

I made a decision here that datasets which are not directly used in any report or dashboard, are ingested to Datahub. In the previous implementation those datasets would have been left out. Would it be better if such action would be behind a feature flag? For our use case it was must have requirement, but I'm not sure anymore whether that is the case with general use case.

looppi avatar Nov 24 '22 09:11 looppi

Unit Test Results (metadata ingestion)

       8 files         8 suites   59m 17s :stopwatch:    767 tests    765 :heavy_check_mark: 2 :zzz: 0 :x: 1 536 runs  1 531 :heavy_check_mark: 5 :zzz: 0 :x:

Results for commit 7152bc7b.

github-actions[bot] avatar Dec 05 '22 13:12 github-actions[bot]

Unit Test Results (build & test)

621 tests   617 :heavy_check_mark:  15m 47s :stopwatch: 157 suites      4 :zzz: 157 files        0 :x:

Results for commit 7152bc7b.

github-actions[bot] avatar Dec 05 '22 13:12 github-actions[bot]

Checking in - where are we on this?

jjoyce0510 avatar Jan 24 '23 03:01 jjoyce0510

I slightly changed course. As I finally realized that our tenant didn't have all the metadata switches enabled, I've now implemented the schema parsing from the scan result. Initially I had good progress, but I'm still trying to fix a case where in PowerBI, a dataset has some kind of folder in it and the columns are inside the folder. I will be dropping the DAX calls and other now unnecessary stuff asap and start to work towards finishing this code.

looppi avatar Jan 24 '23 07:01 looppi

Waiting on another pass from @Mohd-gslab.

Thanks for the updates @looppi !

jjoyce0510 avatar Jan 26 '23 23:01 jjoyce0510

@mohdsiddique I fixed couple of issues in the code, one being Python 3.7 compatibility issue. I think this should be ready for review now. Thank you!

looppi avatar Feb 10 '23 13:02 looppi

@looppi would it be accurate to say that https://github.com/datahub-project/datahub/pull/7519 is a superset of the changes here? I'm planning on merging that one soon cc @aezomz, but wanted to make sure that satisfies your requirements as well.

hsheth2 avatar Apr 06 '23 01:04 hsheth2

@hsheth2 Yes, it's correct, the #7519 includes an implementation of the PowerBI schema handling too.

~~I have one question though: Is there a test case in #7519 for testing the outcome of the column/measure ingestion? I couldn't find a change in the golden files related to this feature. In this PR #6440, all of the golden files changed so that the schemaMetadataevent is visible in the golden files.~~

EDIT: I stand corrected. There was couple of golden files with schema metadata, so all good! :+1:

looppi avatar Apr 06 '23 11:04 looppi

@looppi thanks for the confirmation

I'm closing this PR in favor of https://github.com/datahub-project/datahub/pull/7519

hsheth2 avatar Apr 21 '23 05:04 hsheth2