sqllineage
sqllineage copied to clipboard
column lineage parse result error
Describe the bug For example:
v_sql="""
INSERT OVERWRITE TABLE target
SELECT NVL(PROV_CODE,'999') aaa,COUNT(DISTINCT MSISDN) bbb FROM (
SELECT NVL(PROV_CODE,'000') PROV_CODE,A.MSISDN FROM (
SELECT MSISDN FROM (
SELECT MSISDN, BUSI_ID
FROM source1 A
UNION ALL
SELECT concat(A.MSISDN,'20230826') MSISDN,A.MUSIC_BUSI_CODE
FROM source2 A
union all
SELECT concat(A.MSISDN,'20230826') MSISDN, BUSI_code
FROM source4 A
) C GROUP BY MSISDN
) A
LEFT JOIN source3 B
ON SUBSTRING(A.MSISDN,1,7)=B.MSISDN_NBR_PAR
) T GROUP BY PROV_CODE GROUPING SETS ((),PROV_CODE);
"""
from sqllineage.runner import LineageRunner
parse = LineageRunner(sql=v_sql,dialect='sparksql')
parse.print_column_lineage()
<default>.target.aaa <- t.prov_code <- prov_code
<default>.target.bbb <- t.msisdn <- a.msisdn <- c.msisdn <- <default>.a.msisdn
<default>.target.bbb <- t.msisdn <- a.msisdn <- c.msisdn <- <default>.source1.msisdn
but if sql is lower():
from sqllineage.runner import LineageRunner
parse = LineageRunner(sql=v_sql.lower(),dialect='sparksql')
parse.print_column_lineage()
<default>.target.aaa <- t.prov_code <- prov_code
<default>.target.bbb <- t.msisdn <- a.msisdn <- c.msisdn <- <default>.source1.msisdn
<default>.target.bbb <- t.msisdn <- a.msisdn <- c.msisdn <- <default>.source2.msisdn
<default>.target.bbb <- t.msisdn <- a.msisdn <- c.msisdn <- <default>.source4.msisdn
Expected behavior
- when sql is upper(), column lineage
bbb
is error - column lineage
aaa
is error. should be is source3
Python version (available via python --version
)
- 3.10.13
SQLLineage version (available via sqllineage --version
):
- 1.4.9
There're two issues we need to solve.
- when sql is upper(), column lineage bbb is error
Looks like we have some issue with upper case alias used together with UNION. A minimal example with same issue:
INSERT OVERWRITE TABLE TARGET
SELECT MSISDN, BUSI_ID
FROM SOURCE1
UNION ALL
SELECT CONCAT(A.MSISDN,'20230826') MSISDN, A.MUSIC_BUSI_CODE
FROM SOURCE2 A
Change alias A
to lower case a
generate correct output.
- column lineage aaa is error. should be is source3
Right we we say we don't know if aaa is from subquery a or table source3. But actually we can be smarter, because a is subquery contains only one column named msisdn, which makes table source3 the only possibility. But this "smart logic" is not in our code yet.