ibis
ibis copied to clipboard
feat: Reduce duplicated function calls even after SQL generation
Is your feature request related to a problem?
There is already https://github.com/ibis-project/ibis/issues/8619 and https://github.com/ibis-project/ibis/issues/8484. Both of these, however, are more to do with how we extract/de-duplicate expressions on the ibis side. However, there is another aspect of this that is also important to performance, and that is how we currently generate duplicate code during the actual SQL generation:
import ibis
a = ibis.array([1, 2, 3])
# pretend this is actually expensive
b = a.map(lambda x: x + 1)
c = b.unique()
ibis.to_sql(c)
SELECT
CASE
WHEN LIST_APPLY(
[CAST(1 AS TINYINT), CAST(2 AS TINYINT), CAST(3 AS TINYINT)],
__ibis_param_x__ -> __ibis_param_x__ + CAST(1 AS TINYINT)
) IS NULL
THEN NULL
ELSE LIST_DISTINCT(
LIST_APPLY(
[CAST(1 AS TINYINT), CAST(2 AS TINYINT), CAST(3 AS TINYINT)],
__ibis_param_x__ -> __ibis_param_x__ + CAST(1 AS TINYINT)
)
) + CASE
WHEN LIST_COUNT(
LIST_APPLY(
[CAST(1 AS TINYINT), CAST(2 AS TINYINT), CAST(3 AS TINYINT)],
__ibis_param_x__ -> __ibis_param_x__ + CAST(1 AS TINYINT)
)
) < LENGTH(
LIST_APPLY(
[CAST(1 AS TINYINT), CAST(2 AS TINYINT), CAST(3 AS TINYINT)],
__ibis_param_x__ -> __ibis_param_x__ + CAST(1 AS TINYINT)
)
)
THEN [NULL]
ELSE []
END
END AS "ArrayDistinct(ArrayMap(Array(), Add(x, 1)))"
If we look into the duckdb compiler we find
def visit_ArrayDistinct(self, op, *, arg):
return self.if_(
arg.is_(NULL),
NULL,
self.f.list_distinct(arg)
+ self.if_(
self.f.list_count(arg) < self.f.len(arg),
self.f.array(NULL),
self.f.array(),
),
)
I think this makes this use case very hard to optimize: Other backends might not be doing the duplication. This means that we can't do some sort of rewrite of arg on the ibis side, since it already is as deduplicated as possible when it is represented inside the ArrayDistinct Operation. The duplication only appears when we actually generate the SQL.
Describe the solution you'd like
Currently I don't think we even consider this sort of optimization. we only do rewrites on the ibis side, once we hand the expression to the various compilers it is up to them. The tricky thing is if we wanted to do the optimization, then I think this info would need to get propagated upwards back into the ibis layer, I think because we would need to implement this as a top-level WITH t999 AS... CTE.
IDK how to approach this, if we want to at all, but I just wanted to throw it out there.
One possible solution would be if we exposed an API for forcing CTEing??
b = a.map(lambda x: x + 1)
b2 = b.as_cte()
c = b2.unique()
I don't like that very much though, it is a level of implementation detail that we don't usually expose.
What version of ibis are you running?
main
What backend(s) are you using, if any?
No response
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
I'm not sure we should do this until we know it's actually an issue after we fix the excessive inlining.
@NickCrews If you have some concrete numbers that show a reliable performance improvement by altering the SQL here in some way, please post them here.
Closing this out. I think we're in an okay place for now!