datahub
datahub copied to clipboard
feat(tableau): ability to force extraction of table/column level linage from SQL queries
Related issues
Everything was fixed in backward compatible way, should not break anything
https://github.com/datahub-project/datahub/issues/9841 https://github.com/datahub-project/datahub/issues/9842 https://github.com/datahub-project/datahub/issues/9843 https://github.com/datahub-project/datahub/issues/9844 https://github.com/datahub-project/datahub/issues/9845 https://github.com/datahub-project/datahub/issues/9846 https://github.com/datahub-project/datahub/issues/9847 https://github.com/datahub-project/datahub/issues/9880 https://github.com/datahub-project/datahub/issues/9881
Features
force_extraction_of_lineage_from_custom_sql_queries: true | false
disable_schema_awarenes_during_parsing_of_sql_queries: true | false
Details
force_extraction_of_lineage_from_custom_sql_queries: true | false
If enabled, the tableau source plugin will not rely on Tableau Custom SQL metadata, instead, it will always parse out table and column level linage from SQL queries (does not matter if it's supported and unsupported by Tableau). The feature was implemented, because in our case the quality of Custom SQL Tableau metadata (even for supported SQL queries) was worse than what sqlglot based parser does. It allowed us to finally ingest our huge Tableau projects and be happy enough with quality of the resulting linage.
disable_schema_awarenes_during_parsing_of_sql_queries: true | false
If enabled, the SQL parser will not lookup pre ingested schemas of tables from DataHub during parsing, it's helpful if the Tableau projects are old, some of projects are abandoned and Custom SQL queries are out of sync with real table schemas, in this case SQL parser will be failing silently on attempt to resolve columns referenced by SQL queries. The feature will allow to ingest such a Tableau swamp and have at least table level linage and mostly column level linage (if columns are prefixed with aliases and * is not used).
How to use
source:
type: tableau
config:
...
force_extraction_of_lineage_from_custom_sql_queries: true
disable_schema_awarenes_during_parsing_of_sql_queries: true
ingest_tables_external: true
...
Also, the PR contains some bug fixes for issues that were found by us during ingestion of metadata from Tableau
Checklist
- [x] The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
- [x] Links to related issues (if applicable)
- [x] Tests for the changes have been added/updated (if applicable)
- [ ] Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
- [ ] For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub
@alexs-101 thanks for the PR! Before diving into the code, I wanted to make sure I understood the motivation behind each feature. Btw - it makes it a lot easier to review if things are split up into multiple PRs
Things that make sense to me
force_extraction_of_lineage_from_custom_sql_queries- agree that tableau's lineage is pretty bad, and we should prefer our own to theirs. Likely will want to flip the default on this in the future- The salesforce dialect hack - seems reasonable
Things I'd like to discuss
- The TSQLv2 dialect - ideally we contribute this back to sqlglot, or if they're not willing to accept it, we can merge it into the acryl-sqlglot fork
disable_schema_awarenes_during_parsing_of_sql_queriesand theschema_awareflag - I didn't realize that having custom SQL queries digress from the underlying table schemas was a common thing.- My goal in building the initial parser was that it would be permissive enough that it could handle missing columns gracefully. Do you have some specific examples where a certain query + schema info would fail, but just the query with no schema info generates results that are more correct? If so, maybe we can fix those instances, and avoid needing disabling schema awareness across the board
- Also, you can also disable schema-awareness by not passing a graph instance to the
SchemaResolver- that feels like a cleaner approach
- Adding
in_tables_schemas- it seems like this can be fully derived from the rest of the SqlParsingResult? The logic might be better suited to live in a helper method, and not part of the core SqlParsingResult
One other thing - it seems like the t-sql changes only make sense when QUOTED_IDENTIFIER is OFF (as per https://stackoverflow.com/questions/62917351/what-is-the-syntax-of-a-string-literal-in-t-sql) - so I'm not sure it makes sense to change the default. It might make sense to make this configurable in some way
@hsheth2 big thanx for quick feedback! I will answer (or fix) everything tomorrow morning with a fresh mind.
@hsheth2
Adding in_tables_schemas - it seems like this can be fully derived from the rest of the SqlParsingResult? The logic might be better suited to live in a helper method, and not part of the core SqlParsingResult
Done https://github.com/datahub-project/datahub/pull/9838/commits/2f0eb02119b25286913027823e1263b5a520b826
The TSQLv2 dialect - ideally we contribute this back to sqlglot, or if they're not willing to accept it, we can merge it into the acryl-sqlglot fork. One other thing - it seems like the t-sql changes only make sense when QUOTED_IDENTIFIER is OFF (as per https://stackoverflow.com/questions/62917351/what-is-the-syntax-of-a-string-literal-in-t-sql) - so I'm not sure it makes sense to change the default. It might make sense to make this configurable in some way
I removed my crutch https://github.com/datahub-project/datahub/pull/9838/commits/7abe4bc38fece570d8da6d13089203e4c7a0ff72, wanted to contribute into SQLGlot, but found that the support of strings as aliases was added recently to SQLGlot https://github.com/tobymao/sqlglot/commit/d31ae0decb46678851744356c7b113f8c1c3e8c9
Could DataHub start using the recent version? It's critical for us, because in our case people were actively using " and ' for quoting identifiers of T-SQL queries in Tableau.
disable_schema_awarenes_during_parsing_of_sql_queriesand the schema_aware flag - I didn't realize that having custom SQL queries digress from the underlying table schemas was a common thing. My goal in building the initial parser was that it would be permissive enough that it could handle missing columns gracefully. Do you have some specific examples where a certain query + schema info would fail, but just the query with no schema info generates results that are more correct? If so, maybe we can fix those instances, and avoid needing disabling schema awareness across the board
I tried to reproduce it with no luck, I just remember, that it was failing in deeps of SQLGlot's optimize() function if it gets schema that is out of sync with query (query referencing columns that are not presented in schema) and it was not failing if we don't pass any schema at all
try:
....
statement = sqlglot.optimizer.optimizer.optimize(
statement,
dialect=dialect,
schema=sqlglot_db_schema,
qualify_columns=True,
validate_qualify_columns=False,
identify=True,
# sqlglot calls the db -> schema -> table hierarchy "catalog", "db", "table".
catalog=default_db,
db=default_schema,
rules=RULES_BEFORE_TYPE_ANNOTATION,
)
except (sqlglot.errors.OptimizeError, ValueError) as e:
raise SqlUnderstandingError(
f"sqlglot failed to map columns to their source tables; likely missing/outdated table schema info: {e}"
) from
judging on the message of the new exception that the code throws I was not only one who caught the behaviour
sqlglot failed to map columns to their source tables; likely missing/outdated table schema info
Seems like one of the SQLGlot's optimization rules are buggy. It might have been even fixed already in the recent versions of it. So, maybe DataHub team could update the fork of SQLGlot?
As for now, I followed your recommendation
you can also disable schema-awareness by not passing a graph instance to the SchemaResolver - that feels like a cleaner approach
https://github.com/datahub-project/datahub/pull/9838/commits/076ce37f5d8e64c653eb3db09a8b76760e132ddc
@hsheth2 i'm looking into failed tests
@hsheth2
Judging on the last tests results:
https://github.com/datahub-project/datahub/actions/runs/7938517997 https://github.com/datahub-project/datahub/actions/runs/7938517995
These tests failed:
- test_simple_lineage_multiple_paths seems like not related to my changes and after updating from upstream/master, it looks like someone has already commented it out (disabled)
- test_create_list_get_ingestion_execution_request seems like not related to my changes (let me know if i'm not right)
- test_tableau_cll_ingest i updated golden files, now it passes successfully https://github.com/datahub-project/datahub/pull/9838/commits/f1a05d9d9230dacea3436fefe62b7db0dc239557
Also, I updated my branch from latest upstream/master
Could you please kick the tests again?
@hsheth2 resolved conflicts
@hsheth2 will fix everything tomorrow morning
tbh the fact that we have
ingest_tables_externalat all feels pretty strange to me - I guess it makes sense for CSVs and things, but for snowflake/other warehouse tables, we really should just rely on those other sources. But I guess it's alright for now, especially as it's disabled by default
it's a very helpful feature for those who has ability to ingest Tableau only and don't have access to other metadata sources. cold we keep it?
especially as it's disabled by default
yep
@hsheth2 finished, could you start tests again, please?
@hsheth2
Seems likt this test failed
test_create_list_get_ingestion_execution_request
E - urn:li:dataHubExecutionRequest:219530d5-775d-4a38-9c93-d393cc8239a4
E + urn:li:dataHubExecutionRequest:af738e1d-4870-4519-af1a-fc88c1d57d6b
I checked, it can not be due to my changes
@hsheth2 i pulled in latest changes from master, could you please start tests again?
@hsheth2 I checked failed tests and all of them look like failed NOT due to changes brought by the PR
FAILED tests/integration/s3/test_s3.py::test_data_lake_s3_ingest[multiple_files.json] - Failed: Metadata files differ (use `pytest --update-golden-files` to update):
Urn changed, urn:li:dataset:(urn:li:dataPlatform:s3,test-platform-instance.my-test-bucket/folder_a/folder_aa/folder_aaa/NPS.7.1.package_data_NPS.6.1_ARCN_Lakes_ChemistryData_v1_csv.csv,DEV):
<schemaMetadata> changed:
Item aspect['fields'][20]['type']['type']['com.linkedin.schema.DateType'] added to dictionary.
Item aspect['fields'][20]['type']['type']['com.linkedin.schema.StringType'] removed from dictionary.
Value of aspect['fields'][20]['nativeDataType'] changed from "string" to "date".
FAILED tests/integration/s3/test_s3.py::test_data_lake_local_ingest[multiple_files.json] - Failed: Metadata files differ (use `pytest --update-golden-files` to update):
Urn changed, urn:li:dataset:(urn:li:dataPlatform:file,test-platform-instance.tests/integration/s3/test_data/local_system/folder_a/folder_aa/folder_aaa/NPS.7.1.package_data_NPS.6.1_ARCN_Lakes_ChemistryData_v1_csv.csv,DEV):
<schemaMetadata> changed:
Item aspect['fields'][20]['type']['type']['com.linkedin.schema.DateType'] added to dictionary.
Item aspect['fields'][20]['type']['type']['com.linkedin.schema.StringType'] removed from dictionary.
Value of aspect['fields'][20]['nativeDataType'] changed from "string" to "date".
I will fix it, if you say - fix it. :)
@hsheth2 I checked failed tests and all of them look like failed NOT due to changes brought by the PR
........
I will fix it, if you say - fix it. :)
+1 They also fail in my PR #9874 as well
@alexs-101 @egemenberk I agree those failures look unrelated, and I believe have been fixed on the master branch. I just synced this branch with master, and hopefully that unblocks CI
found bug https://github.com/datahub-project/datahub/issues/10357 when the force_extraction_of_lineage_from_custom_sql_queries property is not leveraged and the error is thrown