datahub icon indicating copy to clipboard operation
datahub copied to clipboard

fix(ingestion): Fetch Upstreams From Columns

Open egemenberk opened this issue 1 year ago • 2 comments

This PR fixes the following two things:

  • Tableau metadata graphql API returns empty list for upstreamTables for embedded datasources while upstreamColumns field includes information. This PR populates upstream table information from upstreamColumns field coming from the Tableau
  • Tableau metadata graphql API returns malfunctioned SQL queries which causing failures in fetching upstream Lineage from the CustomSQLs. This also enables embedded data sources to be connected to the CustomSQLs which are (generally) connected to upstreams from other platforms hence completing the full lineage.

Checklist

  • [x] The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • [ ] Links to related issues (if applicable)
  • [ ] Tests for the changes have been added/updated (if applicable)
  • [ ] Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • [ ] For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

egemenberk avatar Feb 16 '24 17:02 egemenberk

@egemenberk some query cleaning logic is also getting adding in this PR https://github.com/datahub-project/datahub/pull/9838 - that one also removes parameter names and things to make SQL parsing work. Does it make sense to unify across these two query cleaning implementations?

hsheth2 avatar Mar 18 '24 23:03 hsheth2

@egemenberk some query cleaning logic is also getting adding in this PR #9838 - that one also removes parameter names and things to make SQL parsing work. Does it make sense to unify across these two query cleaning implementations?

Hi @hsheth2, I've taken a quick look at the PR you mentioned and it seems to fix the query, so I can remove the clean_query() method call from my implementation. My PR's main focus is to fix fetching upstream lineage from upstreamColumns when upstreamTables field is empty in Tableau response, the clean_query addition was a side fix while working on the task, so I can remove my addition on that and trust the #9838 implementation. Thanks for the information 👍

egemenberk avatar Mar 19 '24 08:03 egemenberk