sqllineage
sqllineage copied to clipboard
Correlated scalar subqueries Column Resolution
SQL
create table src_tbl1 ( c2 int, c3 int );
create table src_tbl2 ( c1 int, c2 int, c4 int );
insert into
tgt_tbl1
select
c1,
(select c2 || c3 || c4 from src_tbl1) as c2
from
src_tbl2
;
To Reproduce
Note here we refer to SQL provided in prior step as stored in a file named test.sql
from sqllineage.runner import LineageRunner
with open("test.sql") as f:
sql = f.read()
schema = {
"src_tbl1": ["c2", "c3"],
"src_tbl2": ["c1", "c2", "c4"],
}
lr = LineageRunner(sql, dialect="ansi")
# lr = LineageRunner(sql, dialect='ansi', metadata_provider=DummyMetaDataProvider(schema))
lr.print_column_lineage()
Actual behavior
<default>.tgt_tbl1.c1 <- <default>.src_tbl2.c1
<default>.tgt_tbl1.c2 <- <default>.src_tbl1.c2
<default>.tgt_tbl1.c2 <- <default>.src_tbl1.c3
<default>.tgt_tbl1.c2 <- <default>.src_tbl1.c4
Expected behavior without metadata
<default>.tgt_tbl1.c1 <- <default>.src_tbl2.c1
<default>.tgt_tbl1.c2 <- <default>.src_tbl1.c2
<default>.tgt_tbl1.c2 <- <default>.src_tbl1.c3
<default>.tgt_tbl1.c2 <- <default>.src_tbl1.c4
<default>.tgt_tbl1.c2 <- <default>.src_tbl2.c2
<default>.tgt_tbl1.c2 <- <default>.src_tbl2.c3
<default>.tgt_tbl1.c2 <- <default>.src_tbl2.c4
Expected behavior with metadata
<default>.tgt_tbl1.c1 <- <default>.src_tbl2.c1
<default>.tgt_tbl1.c2 <- <default>.src_tbl1.c2
<default>.tgt_tbl1.c2 <- <default>.src_tbl1.c3
<default>.tgt_tbl1.c2 <- <default>.src_tbl2.c4
Python version (available via python --version
)
- 3.10
SQLLineage version (available via sqllineage --version
):
- 1.5.0
Additional context
This is yet another moment when I feel amazed that it's allowed to write SQL like this.
To summarize, c2, c3 and c4 can come from src_tbl1 and src_tbl2
select
c1,
(select c2 || c3 || c4 from src_tbl1) as c2
------------^^^
from
src_tbl2
;
Again, tried a few open source SQL database:
dialect | version | src_tbl1 returns multiple rows | src_tbl1 returns 1 row |
---|---|---|---|
mysql | 8.2.0 | ERROR 1242 (21000): Subquery returns more than 1 row | yes |
postgres | 16.1 | ERROR: more than one row returned by a subquery used as an expression | yes |
hive | 3.1.3 | CalciteSubquerySemanticException [Error 10249]: Unsupported SubQuery Expression Currently SubQuery expressions are only allowed as Where and Having Clause predicates (state=42000,code=10249) | same error |
sparksql | 3.5.0 | [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.MUST_AGGREGATE_CORRELATED_SCALAR_SUBQUERY] Unsupported subquery expression: Correlated scalar subqueries must be aggregated to return at most one row. | same error |
trino | 435 | Given correlated subquery is not supported | same error |
When src_tbl1 returns multiple rows, this SQL won't be executable. Hive/Spark/Trino disallow this syntax at parsing phase. MySQL/Postgres postpone throwing exception until execution and would allow if the subquery returns 1 row at execution time.
Generally speaking, I think it's a very narrow use case. And it requires a lot of change to sqllineage (we might need to introduce the concept of lexical scope, like Python doing variable lookup from local -> enclosed -> global -> built-in). That's error-prune and not adding too much value.
Our SQL does not have this kind of syntax. I just found this syntax when looking at the underlying sqlfluff, so I brought it up. Can be processed or not.
OK. We'll keep it open and take it as low priority then.