datahub feat(tableau): ability to force extraction of table/column level linage from SQL queries

Related issues

Everything was fixed in backward compatible way, should not break anything

https://github.com/datahub-project/datahub/issues/9841 https://github.com/datahub-project/datahub/issues/9842 https://github.com/datahub-project/datahub/issues/9843 https://github.com/datahub-project/datahub/issues/9844 https://github.com/datahub-project/datahub/issues/9845 https://github.com/datahub-project/datahub/issues/9846 https://github.com/datahub-project/datahub/issues/9847 https://github.com/datahub-project/datahub/issues/9880 https://github.com/datahub-project/datahub/issues/9881

Features

force_extraction_of_lineage_from_custom_sql_queries: true | false
disable_schema_awarenes_during_parsing_of_sql_queries: true | false

Details

force_extraction_of_lineage_from_custom_sql_queries: true | false If enabled, the tableau source plugin will not rely on Tableau Custom SQL metadata, instead, it will always parse out table and column level linage from SQL queries (does not matter if it's supported and unsupported by Tableau). The feature was implemented, because in our case the quality of Custom SQL Tableau metadata (even for supported SQL queries) was worse than what sqlglot based parser does. It allowed us to finally ingest our huge Tableau projects and be happy enough with quality of the resulting linage.

disable_schema_awarenes_during_parsing_of_sql_queries: true | false If enabled, the SQL parser will not lookup pre ingested schemas of tables from DataHub during parsing, it's helpful if the Tableau projects are old, some of projects are abandoned and Custom SQL queries are out of sync with real table schemas, in this case SQL parser will be failing silently on attempt to resolve columns referenced by SQL queries. The feature will allow to ingest such a Tableau swamp and have at least table level linage and mostly column level linage (if columns are prefixed with aliases and * is not used).

How to use

source:
    type: tableau
    config:
        ...
        force_extraction_of_lineage_from_custom_sql_queries: true
        disable_schema_awarenes_during_parsing_of_sql_queries: true
        ingest_tables_external: true
        ...

Also, the PR contains some bug fixes for issues that were found by us during ingestion of metadata from Tableau

Checklist

[x] The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
[x] Links to related issues (if applicable)
[x] Tests for the changes have been added/updated (if applicable)
[ ] Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
[ ] For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Feb 14 '24 11:02 alexs-101

@alexs-101 thanks for the PR! Before diving into the code, I wanted to make sure I understood the motivation behind each feature. Btw - it makes it a lot easier to review if things are split up into multiple PRs

Things that make sense to me

force_extraction_of_lineage_from_custom_sql_queries - agree that tableau's lineage is pretty bad, and we should prefer our own to theirs. Likely will want to flip the default on this in the future
The salesforce dialect hack - seems reasonable

Things I'd like to discuss

The TSQLv2 dialect - ideally we contribute this back to sqlglot, or if they're not willing to accept it, we can merge it into the acryl-sqlglot fork
disable_schema_awarenes_during_parsing_of_sql_queries and the schema_aware flag - I didn't realize that having custom SQL queries digress from the underlying table schemas was a common thing.
- My goal in building the initial parser was that it would be permissive enough that it could handle missing columns gracefully. Do you have some specific examples where a certain query + schema info would fail, but just the query with no schema info generates results that are more correct? If so, maybe we can fix those instances, and avoid needing disabling schema awareness across the board
- Also, you can also disable schema-awareness by not passing a graph instance to the SchemaResolver - that feels like a cleaner approach
Adding in_tables_schemas - it seems like this can be fully derived from the rest of the SqlParsingResult? The logic might be better suited to live in a helper method, and not part of the core SqlParsingResult

Feb 14 '24 21:02 hsheth2

One other thing - it seems like the t-sql changes only make sense when QUOTED_IDENTIFIER is OFF (as per https://stackoverflow.com/questions/62917351/what-is-the-syntax-of-a-string-literal-in-t-sql) - so I'm not sure it makes sense to change the default. It might make sense to make this configurable in some way

Feb 14 '24 22:02 hsheth2

@hsheth2 big thanx for quick feedback! I will answer (or fix) everything tomorrow morning with a fresh mind.

Feb 15 '24 01:02 alexs-101

@hsheth2

Adding in_tables_schemas - it seems like this can be fully derived from the rest of the SqlParsingResult? The logic might be better suited to live in a helper method, and not part of the core SqlParsingResult

Done https://github.com/datahub-project/datahub/pull/9838/commits/2f0eb02119b25286913027823e1263b5a520b826

The TSQLv2 dialect - ideally we contribute this back to sqlglot, or if they're not willing to accept it, we can merge it into the acryl-sqlglot fork. One other thing - it seems like the t-sql changes only make sense when QUOTED_IDENTIFIER is OFF (as per https://stackoverflow.com/questions/62917351/what-is-the-syntax-of-a-string-literal-in-t-sql) - so I'm not sure it makes sense to change the default. It might make sense to make this configurable in some way

I removed my crutch https://github.com/datahub-project/datahub/pull/9838/commits/7abe4bc38fece570d8da6d13089203e4c7a0ff72, wanted to contribute into SQLGlot, but found that the support of strings as aliases was added recently to SQLGlot https://github.com/tobymao/sqlglot/commit/d31ae0decb46678851744356c7b113f8c1c3e8c9

Could DataHub start using the recent version? It's critical for us, because in our case people were actively using " and ' for quoting identifiers of T-SQL queries in Tableau.

disable_schema_awarenes_during_parsing_of_sql_queries and the schema_aware flag - I didn't realize that having custom SQL queries digress from the underlying table schemas was a common thing. My goal in building the initial parser was that it would be permissive enough that it could handle missing columns gracefully. Do you have some specific examples where a certain query + schema info would fail, but just the query with no schema info generates results that are more correct? If so, maybe we can fix those instances, and avoid needing disabling schema awareness across the board

I tried to reproduce it with no luck, I just remember, that it was failing in deeps of SQLGlot's optimize() function if it gets schema that is out of sync with query (query referencing columns that are not presented in schema) and it was not failing if we don't pass any schema at all

    try:
       ....
        statement = sqlglot.optimizer.optimizer.optimize(
            statement,
            dialect=dialect,
            schema=sqlglot_db_schema,
            qualify_columns=True,
            validate_qualify_columns=False,
            identify=True,
            # sqlglot calls the db -> schema -> table hierarchy "catalog", "db", "table".
            catalog=default_db,
            db=default_schema,
            rules=RULES_BEFORE_TYPE_ANNOTATION,
        )
    except (sqlglot.errors.OptimizeError, ValueError) as e:
        raise SqlUnderstandingError(
            f"sqlglot failed to map columns to their source tables; likely missing/outdated table schema info: {e}"
        ) from

judging on the message of the new exception that the code throws I was not only one who caught the behaviour

sqlglot failed to map columns to their source tables; likely missing/outdated table schema info

Seems like one of the SQLGlot's optimization rules are buggy. It might have been even fixed already in the recent versions of it. So, maybe DataHub team could update the fork of SQLGlot?

As for now, I followed your recommendation

you can also disable schema-awareness by not passing a graph instance to the SchemaResolver - that feels like a cleaner approach

https://github.com/datahub-project/datahub/pull/9838/commits/076ce37f5d8e64c653eb3db09a8b76760e132ddc

Feb 17 '24 01:02 alexs-101

@hsheth2 i'm looking into failed tests

Feb 22 '24 11:02 alexs-101

@hsheth2

Judging on the last tests results:

https://github.com/datahub-project/datahub/actions/runs/7938517997 https://github.com/datahub-project/datahub/actions/runs/7938517995

These tests failed:

test_simple_lineage_multiple_paths seems like not related to my changes and after updating from upstream/master, it looks like someone has already commented it out (disabled)

test_create_list_get_ingestion_execution_request seems like not related to my changes (let me know if i'm not right)

test_tableau_cll_ingest i updated golden files, now it passes successfully https://github.com/datahub-project/datahub/pull/9838/commits/f1a05d9d9230dacea3436fefe62b7db0dc239557

Also, I updated my branch from latest upstream/master

Could you please kick the tests again?

Feb 27 '24 18:02 alexs-101

@hsheth2 resolved conflicts

Mar 01 '24 16:03 alexs-101

@hsheth2 will fix everything tomorrow morning

Mar 04 '24 21:03 alexs-101

tbh the fact that we have ingest_tables_external at all feels pretty strange to me - I guess it makes sense for CSVs and things, but for snowflake/other warehouse tables, we really should just rely on those other sources. But I guess it's alright for now, especially as it's disabled by default

it's a very helpful feature for those who has ability to ingest Tableau only and don't have access to other metadata sources. cold we keep it?

especially as it's disabled by default

yep

Mar 05 '24 17:03 alexs-101

@hsheth2 finished, could you start tests again, please?

Mar 05 '24 17:03 alexs-101

@hsheth2

Seems likt this test failed

test_create_list_get_ingestion_execution_request

E         - urn:li:dataHubExecutionRequest:219530d5-775d-4a38-9c93-d393cc8239a4
E         + urn:li:dataHubExecutionRequest:af738e1d-4870-4519-af1a-fc88c1d57d6b

I checked, it can not be due to my changes

Mar 06 '24 12:03 alexs-101

@hsheth2 i pulled in latest changes from master, could you please start tests again?

Mar 07 '24 19:03 alexs-101

@hsheth2 I checked failed tests and all of them look like failed NOT due to changes brought by the PR

FAILED tests/integration/s3/test_s3.py::test_data_lake_s3_ingest[multiple_files.json] - Failed: Metadata files differ (use `pytest --update-golden-files` to update):

Urn changed, urn:li:dataset:(urn:li:dataPlatform:s3,test-platform-instance.my-test-bucket/folder_a/folder_aa/folder_aaa/NPS.7.1.package_data_NPS.6.1_ARCN_Lakes_ChemistryData_v1_csv.csv,DEV):
<schemaMetadata> changed:
	Item aspect['fields'][20]['type']['type']['com.linkedin.schema.DateType'] added to dictionary.
	Item aspect['fields'][20]['type']['type']['com.linkedin.schema.StringType'] removed from dictionary.
	Value of aspect['fields'][20]['nativeDataType'] changed from "string" to "date".

FAILED tests/integration/s3/test_s3.py::test_data_lake_local_ingest[multiple_files.json] - Failed: Metadata files differ (use `pytest --update-golden-files` to update):

Urn changed, urn:li:dataset:(urn:li:dataPlatform:file,test-platform-instance.tests/integration/s3/test_data/local_system/folder_a/folder_aa/folder_aaa/NPS.7.1.package_data_NPS.6.1_ARCN_Lakes_ChemistryData_v1_csv.csv,DEV):
<schemaMetadata> changed:
	Item aspect['fields'][20]['type']['type']['com.linkedin.schema.DateType'] added to dictionary.
	Item aspect['fields'][20]['type']['type']['com.linkedin.schema.StringType'] removed from dictionary.
	Value of aspect['fields'][20]['nativeDataType'] changed from "string" to "date".

I will fix it, if you say - fix it. :)

Mar 19 '24 16:03 alexs-101

@hsheth2 I checked failed tests and all of them look like failed NOT due to changes brought by the PR

........

I will fix it, if you say - fix it. :)

+1 They also fail in my PR #9874 as well

Mar 20 '24 14:03 egemenberk

@alexs-101 @egemenberk I agree those failures look unrelated, and I believe have been fixed on the master branch. I just synced this branch with master, and hopefully that unblocks CI

Mar 21 '24 05:03 hsheth2

found bug https://github.com/datahub-project/datahub/issues/10357 when the force_extraction_of_lineage_from_custom_sql_queries property is not leveraged and the error is thrown

Apr 22 '24 19:04 usmanovbf

datahub datahub copied to clipboard

feat(tableau): ability to force extraction of table/column level linage from SQL queries

Related issues

Features

Details

How to use

Checklist

datahub
datahub copied to clipboard