ibis icon indicating copy to clipboard operation
ibis copied to clipboard

perf: .mutate() on wide tables is slow

Open jc-5s opened this issue 8 months ago • 4 comments

I'm trying to trace through the cause for an expression (used for an ETL process) that takes around 5 seconds to construct.

The expression involves building about 100 subtables from a main table (which has ~100 columns), each subtable contains 4-10 columns. The 100 subtables are then concatenated via a union operation

Constructing the 100 subtables takes almost all the expression building time. Each subtable involves a mutate on the main table, and then a 2nd mutate, so there are ~200 calls to Table.mutate made in building the 100 subtables. Key pain points within this process are all within ibis expression building code, in particular DerefMap.from_targets and DerefMap.dereference which are used internally by Expr.bind and Table.select, in turn used by Table.mutate

Are there any known places within these functions where optimisations might be possible (memoisation / different data structure for faster lookup/search etc)?

I will try to come up with a reproducible example and follow up separately

Results from line profiling the ibis internal functions when building the expression are provided below. Note that using lprof introduces overhead so what took 5s without it takes 15s with it. Times below are all in seconds

I'm using latest release of ibis 10.2.0

Thanks!

Table.mutate

Total time: 12.6961 s
File: [python3.12/site-packages/ibis/expr/types/relations.py]
Function: mutate at line 1866

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1866                                               def mutate(self, *exprs: Sequence[ir.Expr] | None, **mutations: ir.Value) -> Table:
...
  1954       212          0.0      0.0      0.0          node = self.op()
  1955       212          3.9      0.0     30.6          values = self.bind(*exprs, **mutations)
  1956       212          0.0      0.0      0.0          values = unwrap_aliases(values)
  1957                                                   # allow overriding of fields, hence the mutation behavior
  1958       212          0.9      0.0      7.1          values = {**node.fields, **values}
  1959       212          7.9      0.0     62.3          return self.select(**values)

Expr.bind

Total time: 11.9379 s
File: [python3.12/site-packages/ibis/expr/types/relations.py]
Function: bind at line 270

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   270                                               def bind(self, *args: Any, **kwargs: Any) -> tuple[Value, ...]:
...
   289       708          0.5      0.0      3.9          values = self._fast_bind(*args, **kwargs)
   290                                                   # dereference the values to `self`
   291       708          9.1      0.0     75.9          dm = DerefMap.from_targets(self.op())
   292       708          0.0      0.0      0.0          result = []
   293     28366          0.0      0.0      0.1          for original in values:
   294     27658          2.4      0.0     19.7              value = dm.dereference(original.op()).to_expr()
   295     27658          0.0      0.0      0.4              value = value.name(original.get_name())
   296     27658          0.0      0.0      0.1              result.append(value)
   297       708          0.0      0.0      0.0          return tuple(result)

Table.select

Total time: 9.77739 s
File: [python3.12/site-packages/ibis/expr/types/relations.py]
Function: select at line 1961

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1961                                               def select(
  1962                                                   self,
  1963                                                   *exprs: ir.Value | str | Iterable[ir.Value | str],
  1964                                                   **named_exprs: ir.Value | str,
  1965                                               ) -> Table:
...
  2138       305          0.0      0.0      0.0          from ibis.expr.rewrites import rewrite_project_input
  2139                                           
  2140       305          6.8      0.0     69.7          values = self.bind(*exprs, **named_exprs)
  2141       305          0.0      0.0      0.4          values = unwrap_aliases(values)
  2142       305          0.0      0.0      0.0          if not values:
  2143                                                       raise com.IbisTypeError(
  2144                                                           "You must select at least one column for a valid projection"
  2145                                                       )
  2146                                           
  2147                                                   # we need to detect reductions which are either turned into window functions
  2148                                                   # or scalar subqueries depending on whether they are originating from self
  2149       610          0.0      0.0      0.0          values = {
  2150     26987          2.4      0.0     25.0              k: rewrite_project_input(v, relation=self.op()) for k, v in values.items()
  2151                                                   }
  2152       305          0.5      0.0      4.8          return ops.Project(self, values).to_expr()

DerefMap.from_targets

Total time: 8.47946 s
File: [python3.12/site-packages/ibis/expr/rewrites.py]
Function: from_targets at line 63

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    63                                               @classmethod
    64                                               def from_targets(cls, rels, extra=None):
...
    81       708          0.0      0.0      0.0          rels = promote_list(rels)
    82       708          0.0      0.0      0.0          mapping = defaultdict(dict)
    83      1416          0.0      0.0      0.0          for rel in rels:
    84     81185          3.1      0.0     36.3              for field in rel.fields.values():
    85    377543          0.5      0.0      5.6                  for value, distance in cls.backtrack(field):
    86    297066          0.7      0.0      7.8                      mapping[value][field] = distance
    87                                           
    88       708          0.0      0.0      0.0          subs, ambigs = {}, {}
    89    296471          0.1      0.0      0.8          for from_, to in mapping.items():
    90    295763          0.1      0.0      1.6              mindist = min(to.values())
    91    592829          0.3      0.0      3.0              minkeys = [k for k, v in to.items() if v == mindist]
    92                                                       # if all the closest fields are from the same relation, then we
    93                                                       # can safely substitute them and we pick the first one arbitrarily
    94    295763          0.4      0.0      4.6              if all(minkeys[0].relations == k.relations for k in minkeys):
    95    295763          0.2      0.0      2.0                  subs[from_] = minkeys[0]
    96                                                       else:
    97                                                           ambigs[from_] = minkeys
    98                                           
    99       708          0.0      0.0      0.0          if extra is not None:
   100                                                       subs.update(extra)
   101                                           
   102       708          3.2      0.0     38.2          return cls(rels, subs, ambigs)

DerefMap.dereference

Total time: 2.16665 s
File: [python3.12/site-packages/ibis/expr/rewrites.py]
Function: dereference at line 135

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   135                                               def dereference(self, value):
...
   151     27658          0.7      0.0     30.5          ambigs = value.find(lambda x: x in self.ambigs, filter=ops.Value)
   152     27658          0.0      0.0      0.3          if ambigs:
   153                                                       raise IbisInputError(
   154                                                           f"Ambiguous field reference {ambigs!r} in expression {value!r}"
   155                                                       )
   156     27658          1.5      0.0     69.2          return value.replace(self.subs, filter=ops.Value)

jc-5s avatar Mar 07 '25 17:03 jc-5s