ibis feat: Reduce duplicated function calls even after SQL generation

Is your feature request related to a problem?

There is already https://github.com/ibis-project/ibis/issues/8619 and https://github.com/ibis-project/ibis/issues/8484. Both of these, however, are more to do with how we extract/de-duplicate expressions on the ibis side. However, there is another aspect of this that is also important to performance, and that is how we currently generate duplicate code during the actual SQL generation:

import ibis

a = ibis.array([1, 2, 3])
# pretend this is actually expensive
b = a.map(lambda x: x + 1)
c = b.unique()
ibis.to_sql(c)

SELECT
  CASE
    WHEN LIST_APPLY(
      [CAST(1 AS TINYINT), CAST(2 AS TINYINT), CAST(3 AS TINYINT)],
      __ibis_param_x__ -> __ibis_param_x__ + CAST(1 AS TINYINT)
    ) IS NULL
    THEN NULL
    ELSE LIST_DISTINCT(
      LIST_APPLY(
        [CAST(1 AS TINYINT), CAST(2 AS TINYINT), CAST(3 AS TINYINT)],
        __ibis_param_x__ -> __ibis_param_x__ + CAST(1 AS TINYINT)
      )
    ) + CASE
      WHEN LIST_COUNT(
        LIST_APPLY(
          [CAST(1 AS TINYINT), CAST(2 AS TINYINT), CAST(3 AS TINYINT)],
          __ibis_param_x__ -> __ibis_param_x__ + CAST(1 AS TINYINT)
        )
      ) < LENGTH(
        LIST_APPLY(
          [CAST(1 AS TINYINT), CAST(2 AS TINYINT), CAST(3 AS TINYINT)],
          __ibis_param_x__ -> __ibis_param_x__ + CAST(1 AS TINYINT)
        )
      )
      THEN [NULL]
      ELSE []
    END
  END AS "ArrayDistinct(ArrayMap(Array(), Add(x, 1)))"

If we look into the duckdb compiler we find

    def visit_ArrayDistinct(self, op, *, arg):
        return self.if_(
            arg.is_(NULL),
            NULL,
            self.f.list_distinct(arg)
            + self.if_(
                self.f.list_count(arg) < self.f.len(arg),
                self.f.array(NULL),
                self.f.array(),
            ),
        )

I think this makes this use case very hard to optimize: Other backends might not be doing the duplication. This means that we can't do some sort of rewrite of arg on the ibis side, since it already is as deduplicated as possible when it is represented inside the ArrayDistinct Operation. The duplication only appears when we actually generate the SQL.

Describe the solution you'd like

Currently I don't think we even consider this sort of optimization. we only do rewrites on the ibis side, once we hand the expression to the various compilers it is up to them. The tricky thing is if we wanted to do the optimization, then I think this info would need to get propagated upwards back into the ibis layer, I think because we would need to implement this as a top-level WITH t999 AS... CTE.

IDK how to approach this, if we want to at all, but I just wanted to throw it out there.

One possible solution would be if we exposed an API for forcing CTEing??

b = a.map(lambda x: x + 1)
b2 = b.as_cte()
c = b2.unique()

I don't like that very much though, it is a level of implementation detail that we don't usually expose.

What version of ibis are you running?

main

What backend(s) are you using, if any?

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Mar 25 '24 21:03 NickCrews

I'm not sure we should do this until we know it's actually an issue after we fix the excessive inlining.

Apr 02 '24 12:04 cpcloud

@NickCrews If you have some concrete numbers that show a reliable performance improvement by altering the SQL here in some way, please post them here.

Apr 02 '24 12:04 cpcloud

Closing this out. I think we're in an okay place for now!

Jun 25 '24 13:06 cpcloud

ibis ibis copied to clipboard

feat: Reduce duplicated function calls even after SQL generation

Is your feature request related to a problem?

Describe the solution you'd like

What version of ibis are you running?

What backend(s) are you using, if any?

Code of Conduct

ibis
ibis copied to clipboard