sqllineage icon indicating copy to clipboard operation
sqllineage copied to clipboard

column lineage parse result error

Open delphisharp opened this issue 1 year ago • 1 comments

Describe the bug For example:

v_sql="""
INSERT OVERWRITE TABLE target 
SELECT NVL(PROV_CODE,'999') aaa,COUNT(DISTINCT MSISDN) bbb FROM (
SELECT NVL(PROV_CODE,'000') PROV_CODE,A.MSISDN FROM (
SELECT MSISDN FROM (
	SELECT MSISDN, BUSI_ID
	FROM source1 A
	UNION ALL
	SELECT  concat(A.MSISDN,'20230826') MSISDN,A.MUSIC_BUSI_CODE
	 FROM source2 A
	 union all
	 SELECT concat(A.MSISDN,'20230826') MSISDN, BUSI_code
	FROM source4 A
) C GROUP BY MSISDN
) A
LEFT JOIN source3 B
ON SUBSTRING(A.MSISDN,1,7)=B.MSISDN_NBR_PAR
) T GROUP BY PROV_CODE GROUPING SETS ((),PROV_CODE);
"""
from sqllineage.runner import LineageRunner
parse = LineageRunner(sql=v_sql,dialect='sparksql')
parse.print_column_lineage()
<default>.target.aaa <- t.prov_code <- prov_code
<default>.target.bbb <- t.msisdn <- a.msisdn <- c.msisdn <- <default>.a.msisdn
<default>.target.bbb <- t.msisdn <- a.msisdn <- c.msisdn <- <default>.source1.msisdn

but if sql is lower():

from sqllineage.runner import LineageRunner
parse = LineageRunner(sql=v_sql.lower(),dialect='sparksql')
parse.print_column_lineage()
<default>.target.aaa <- t.prov_code <- prov_code
<default>.target.bbb <- t.msisdn <- a.msisdn <- c.msisdn <- <default>.source1.msisdn
<default>.target.bbb <- t.msisdn <- a.msisdn <- c.msisdn <- <default>.source2.msisdn
<default>.target.bbb <- t.msisdn <- a.msisdn <- c.msisdn <- <default>.source4.msisdn

Expected behavior

  1. when sql is upper(), column lineage bbb is error
  2. column lineage aaa is error. should be is source3

Python version (available via python --version)

  • 3.10.13

SQLLineage version (available via sqllineage --version):

  • 1.4.9

delphisharp avatar Jan 04 '24 10:01 delphisharp

There're two issues we need to solve.

  1. when sql is upper(), column lineage bbb is error

Looks like we have some issue with upper case alias used together with UNION. A minimal example with same issue:

INSERT OVERWRITE TABLE TARGET
SELECT MSISDN, BUSI_ID
FROM SOURCE1
UNION ALL
SELECT  CONCAT(A.MSISDN,'20230826') MSISDN, A.MUSIC_BUSI_CODE
FROM SOURCE2 A

Change alias A to lower case a generate correct output.

  1. column lineage aaa is error. should be is source3

Right we we say we don't know if aaa is from subquery a or table source3. But actually we can be smarter, because a is subquery contains only one column named msisdn, which makes table source3 the only possibility. But this "smart logic" is not in our code yet.

reata avatar Jan 06 '24 14:01 reata