sqllineage icon indicating copy to clipboard operation
sqllineage copied to clipboard

Correlated scalar subqueries Column Resolution

Open maoxingda opened this issue 1 year ago • 3 comments

SQL

create table src_tbl1 ( c2 int, c3 int );
create table src_tbl2 ( c1 int, c2 int, c4 int );

insert into
    tgt_tbl1
select 
    c1, 
    (select c2 || c3 || c4 from src_tbl1) as c2
from 
    src_tbl2
;

To Reproduce

Note here we refer to SQL provided in prior step as stored in a file named test.sql

from sqllineage.runner import LineageRunner

with open("test.sql") as f:
    sql = f.read()

schema = {
    "src_tbl1": ["c2", "c3"],
    "src_tbl2": ["c1", "c2", "c4"],
}
lr = LineageRunner(sql, dialect="ansi")
# lr = LineageRunner(sql, dialect='ansi', metadata_provider=DummyMetaDataProvider(schema))

lr.print_column_lineage()

Actual behavior

<default>.tgt_tbl1.c1 <- <default>.src_tbl2.c1
<default>.tgt_tbl1.c2 <- <default>.src_tbl1.c2
<default>.tgt_tbl1.c2 <- <default>.src_tbl1.c3
<default>.tgt_tbl1.c2 <- <default>.src_tbl1.c4

Expected behavior without metadata

<default>.tgt_tbl1.c1 <- <default>.src_tbl2.c1
<default>.tgt_tbl1.c2 <- <default>.src_tbl1.c2
<default>.tgt_tbl1.c2 <- <default>.src_tbl1.c3
<default>.tgt_tbl1.c2 <- <default>.src_tbl1.c4
<default>.tgt_tbl1.c2 <- <default>.src_tbl2.c2
<default>.tgt_tbl1.c2 <- <default>.src_tbl2.c3
<default>.tgt_tbl1.c2 <- <default>.src_tbl2.c4

Expected behavior with metadata

<default>.tgt_tbl1.c1 <- <default>.src_tbl2.c1
<default>.tgt_tbl1.c2 <- <default>.src_tbl1.c2
<default>.tgt_tbl1.c2 <- <default>.src_tbl1.c3
<default>.tgt_tbl1.c2 <- <default>.src_tbl2.c4

Python version (available via python --version)

  • 3.10

SQLLineage version (available via sqllineage --version):

  • 1.5.0

Additional context

maoxingda avatar Jan 27 '24 12:01 maoxingda

This is yet another moment when I feel amazed that it's allowed to write SQL like this.

To summarize, c2, c3 and c4 can come from src_tbl1 and src_tbl2

select
    c1,
    (select c2 || c3 || c4 from src_tbl1) as c2
------------^^^
from
    src_tbl2
;

Again, tried a few open source SQL database:

dialect version src_tbl1 returns multiple rows src_tbl1 returns 1 row
mysql 8.2.0 ERROR 1242 (21000): Subquery returns more than 1 row yes
postgres 16.1 ERROR: more than one row returned by a subquery used as an expression yes
hive 3.1.3 CalciteSubquerySemanticException [Error 10249]: Unsupported SubQuery Expression Currently SubQuery expressions are only allowed as Where and Having Clause predicates (state=42000,code=10249) same error
sparksql 3.5.0 [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.MUST_AGGREGATE_CORRELATED_SCALAR_SUBQUERY] Unsupported subquery expression: Correlated scalar subqueries must be aggregated to return at most one row. same error
trino 435 Given correlated subquery is not supported same error

When src_tbl1 returns multiple rows, this SQL won't be executable. Hive/Spark/Trino disallow this syntax at parsing phase. MySQL/Postgres postpone throwing exception until execution and would allow if the subquery returns 1 row at execution time.

Generally speaking, I think it's a very narrow use case. And it requires a lot of change to sqllineage (we might need to introduce the concept of lexical scope, like Python doing variable lookup from local -> enclosed -> global -> built-in). That's error-prune and not adding too much value.

reata avatar Jan 30 '24 13:01 reata

Our SQL does not have this kind of syntax. I just found this syntax when looking at the underlying sqlfluff, so I brought it up. Can be processed or not.

maoxingda avatar Jan 30 '24 13:01 maoxingda

OK. We'll keep it open and take it as low priority then.

reata avatar Jan 30 '24 14:01 reata