datachain icon indicating copy to clipboard operation
datachain copied to clipboard

SQL compilation generates too many SELECTs

Open rlamy opened this issue 1 year ago • 0 comments

The SQL queries we generate are often unnecessarily complicated, particularly because they use way too many nested SELECTs. For instance, this:

from datachain import DataChain, C

chain = DataChain.from_values(a=range(10)).mutate(b=C("a")).order_by("b").select("b")
print(chain._query.apply_steps().select())

returns

SELECT anon_1.sys__id, anon_1.sys__rand, anon_1.b 
FROM (SELECT anon_2.sys__id AS sys__id, anon_2.sys__rand AS sys__rand, anon_2.a AS a, anon_2.b AS b 
FROM (SELECT anon_3.sys__id AS sys__id, anon_3.sys__rand AS sys__rand, anon_3.a AS a, a AS b 
FROM (SELECT anon_4.sys__id AS sys__id, anon_4.sys__rand AS sys__rand, anon_4.a AS a 
FROM (SELECT "udf_B4XCPB".sys__id AS sys__id, "udf_B4XCPB".sys__rand AS sys__rand, "udf_B4XCPB".a AS a 
FROM "udf_B4XCPB") AS anon_4) AS anon_3) AS anon_2 ORDER BY anon_2.b) AS anon_1

with 5 SELECTs even though only one is needed.

That may not be much of a problem in practice since reasonable DB engines should optimise that query so that it's equivalent to to the single-select version. However:

  • We don't actually know that all DBs will properly optimise the query.
  • That could cause even moderately complex chains to hit statement size limits.
  • That makes it hard to debug SQL generation.

I expect that QueryGenerator is partly to blame for this (because it forces us to call select() after every query step), so #456 should mitigate this issue.

rlamy avatar Sep 24 '24 15:09 rlamy