ibis
ibis copied to clipboard
perf: .mutate() on wide tables is slow
I'm trying to trace through the cause for an expression (used for an ETL process) that takes around 5 seconds to construct.
The expression involves building about 100 subtables from a main table (which has ~100 columns), each subtable contains 4-10 columns. The 100 subtables are then concatenated via a union operation
Constructing the 100 subtables takes almost all the expression building time. Each subtable involves a mutate on the main table, and then a 2nd mutate, so there are ~200 calls to Table.mutate made in building the 100 subtables. Key pain points within this process are all within ibis expression building code, in particular DerefMap.from_targets and DerefMap.dereference which are used internally by Expr.bind and Table.select, in turn used by Table.mutate
Are there any known places within these functions where optimisations might be possible (memoisation / different data structure for faster lookup/search etc)?
I will try to come up with a reproducible example and follow up separately
Results from line profiling the ibis internal functions when building the expression are provided below. Note that using lprof introduces overhead so what took 5s without it takes 15s with it. Times below are all in seconds
I'm using latest release of ibis 10.2.0
Thanks!
Table.mutate
Total time: 12.6961 s
File: [python3.12/site-packages/ibis/expr/types/relations.py]
Function: mutate at line 1866
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1866 def mutate(self, *exprs: Sequence[ir.Expr] | None, **mutations: ir.Value) -> Table:
...
1954 212 0.0 0.0 0.0 node = self.op()
1955 212 3.9 0.0 30.6 values = self.bind(*exprs, **mutations)
1956 212 0.0 0.0 0.0 values = unwrap_aliases(values)
1957 # allow overriding of fields, hence the mutation behavior
1958 212 0.9 0.0 7.1 values = {**node.fields, **values}
1959 212 7.9 0.0 62.3 return self.select(**values)
Expr.bind
Total time: 11.9379 s
File: [python3.12/site-packages/ibis/expr/types/relations.py]
Function: bind at line 270
Line # Hits Time Per Hit % Time Line Contents
==============================================================
270 def bind(self, *args: Any, **kwargs: Any) -> tuple[Value, ...]:
...
289 708 0.5 0.0 3.9 values = self._fast_bind(*args, **kwargs)
290 # dereference the values to `self`
291 708 9.1 0.0 75.9 dm = DerefMap.from_targets(self.op())
292 708 0.0 0.0 0.0 result = []
293 28366 0.0 0.0 0.1 for original in values:
294 27658 2.4 0.0 19.7 value = dm.dereference(original.op()).to_expr()
295 27658 0.0 0.0 0.4 value = value.name(original.get_name())
296 27658 0.0 0.0 0.1 result.append(value)
297 708 0.0 0.0 0.0 return tuple(result)
Table.select
Total time: 9.77739 s
File: [python3.12/site-packages/ibis/expr/types/relations.py]
Function: select at line 1961
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1961 def select(
1962 self,
1963 *exprs: ir.Value | str | Iterable[ir.Value | str],
1964 **named_exprs: ir.Value | str,
1965 ) -> Table:
...
2138 305 0.0 0.0 0.0 from ibis.expr.rewrites import rewrite_project_input
2139
2140 305 6.8 0.0 69.7 values = self.bind(*exprs, **named_exprs)
2141 305 0.0 0.0 0.4 values = unwrap_aliases(values)
2142 305 0.0 0.0 0.0 if not values:
2143 raise com.IbisTypeError(
2144 "You must select at least one column for a valid projection"
2145 )
2146
2147 # we need to detect reductions which are either turned into window functions
2148 # or scalar subqueries depending on whether they are originating from self
2149 610 0.0 0.0 0.0 values = {
2150 26987 2.4 0.0 25.0 k: rewrite_project_input(v, relation=self.op()) for k, v in values.items()
2151 }
2152 305 0.5 0.0 4.8 return ops.Project(self, values).to_expr()
DerefMap.from_targets
Total time: 8.47946 s
File: [python3.12/site-packages/ibis/expr/rewrites.py]
Function: from_targets at line 63
Line # Hits Time Per Hit % Time Line Contents
==============================================================
63 @classmethod
64 def from_targets(cls, rels, extra=None):
...
81 708 0.0 0.0 0.0 rels = promote_list(rels)
82 708 0.0 0.0 0.0 mapping = defaultdict(dict)
83 1416 0.0 0.0 0.0 for rel in rels:
84 81185 3.1 0.0 36.3 for field in rel.fields.values():
85 377543 0.5 0.0 5.6 for value, distance in cls.backtrack(field):
86 297066 0.7 0.0 7.8 mapping[value][field] = distance
87
88 708 0.0 0.0 0.0 subs, ambigs = {}, {}
89 296471 0.1 0.0 0.8 for from_, to in mapping.items():
90 295763 0.1 0.0 1.6 mindist = min(to.values())
91 592829 0.3 0.0 3.0 minkeys = [k for k, v in to.items() if v == mindist]
92 # if all the closest fields are from the same relation, then we
93 # can safely substitute them and we pick the first one arbitrarily
94 295763 0.4 0.0 4.6 if all(minkeys[0].relations == k.relations for k in minkeys):
95 295763 0.2 0.0 2.0 subs[from_] = minkeys[0]
96 else:
97 ambigs[from_] = minkeys
98
99 708 0.0 0.0 0.0 if extra is not None:
100 subs.update(extra)
101
102 708 3.2 0.0 38.2 return cls(rels, subs, ambigs)
DerefMap.dereference
Total time: 2.16665 s
File: [python3.12/site-packages/ibis/expr/rewrites.py]
Function: dereference at line 135
Line # Hits Time Per Hit % Time Line Contents
==============================================================
135 def dereference(self, value):
...
151 27658 0.7 0.0 30.5 ambigs = value.find(lambda x: x in self.ambigs, filter=ops.Value)
152 27658 0.0 0.0 0.3 if ambigs:
153 raise IbisInputError(
154 f"Ambiguous field reference {ambigs!r} in expression {value!r}"
155 )
156 27658 1.5 0.0 69.2 return value.replace(self.subs, filter=ops.Value)