bug: Context shared across multiple rewrite operations
What happened?
Running the code below results in a error due to a pyspark specific rewrite rule.
import ibis
from pyspark.sql import SparkSession
backend = ibis.pyspark.connect(SparkSession.builder.getOrCreate())
ibis.set_backend(backend)
table = ibis.memtable({"test": [1, 2, 3, 4, 5]})
result = table.mutate(new_col=ibis.ntile(2).over(order_by=ibis.random())).limit(10).to_pandas()
The issue can be trivially solved by adding a **kwargs argument to the offset_to_filter rewrite rule: https://github.com/ibis-project/ibis/blob/5bb4971863b2060f1545de425eacb40168b541e5/ibis/backends/pyspark/compiler.py#L26
Nevertheless, it seems unusual that this rewrite rule should be provided with the parameter y. Doing some investigation in the code it seems like a previous rewrite rule add_order_by_to_empty_ranking_window_functions is responsible for the addition of the parameter: https://github.com/ibis-project/ibis/blob/5bb4971863b2060f1545de425eacb40168b541e5/ibis/backends/sql/rewrites.py#L321
I am not familiar with the details of ibis's implementation I am not entirely sure if this is intentional behavior. I could track the root cause to a shared dictionary in the replacement logic which is created once and then shared among all rewrite operations: https://github.com/ibis-project/ibis/blob/5bb4971863b2060f1545de425eacb40168b541e5/ibis/common/graph.py#L179
In a quick test I fixed the issue by moving the creation of the dictionary into the local function here, which ensures that ctx is always local to the the current pattern being processed and that state does not carry over to other rewrite rules. Interestingly, this does not seem to break any of the tests, so I assume this behaviour might be unintentional. A better fix might be to ensure that the context variable is not changed when calling obj.match(recreated, ctx) but that depends on the design considerations of the project.
Happy to provide a fix PR after some guidance on what the right way to move forward would be 🙂 .
What version of ibis are you using?
9.1.0
What backend(s) are you using, if any?
PySpark, issue seems to be independent of backend though.
Relevant log output
Traceback (most recent call last):
File "/Users/hornm/Software/ibis/investigate_issue.py", line 7, in <module>
result = table.mutate(new_col=ibis.ntile(2).over(order_by=ibis.random())).limit(10).to_pandas()
File "/Users/hornm/Software/ibis/ibis/expr/types/relations.py", line 3545, in to_pandas
return self.execute(**kwargs)
File "/Users/hornm/Software/ibis/ibis/expr/types/core.py", line 393, in execute
return self._find_backend(use_default=True).execute(
File "/Users/hornm/Software/ibis/ibis/backends/pyspark/__init__.py", line 363, in execute
sql = self.compile(table, params=params, limit=limit, **kwargs)
File "/Users/hornm/Software/ibis/ibis/backends/sql/__init__.py", line 177, in compile
query = self._to_sqlglot(expr, limit=limit, params=params, **kwargs)
File "/Users/hornm/Software/ibis/ibis/backends/sql/__init__.py", line 159, in _to_sqlglot
sql = self.compiler.translate(table_expr.op(), params=params)
File "/Users/hornm/Software/ibis/ibis/backends/sql/compiler.py", line 525, in translate
op, ctes = sqlize(
File "/Users/hornm/Software/ibis/ibis/backends/sql/rewrites.py", line 284, in sqlize
node = node.replace(reduce(operator.or_, rewrites))
File "/Users/hornm/Software/ibis/ibis/common/graph.py", line 464, in replace
results = self.map(replacer, filter=filter)
File "/Users/hornm/Software/ibis/ibis/common/graph.py", line 265, in map
results[node] = fn(node, results, **kwargs)
File "/Users/hornm/Software/ibis/ibis/common/graph.py", line 188, in fn
if (result := obj.match(recreated, ctx)) is NoMatch:
File "/Users/hornm/Software/ibis/ibis/common/patterns.py", line 932, in match
result = pattern.match(value, context)
File "/Users/hornm/Software/ibis/ibis/common/patterns.py", line 932, in match
result = pattern.match(value, context)
File "/Users/hornm/Software/ibis/ibis/common/patterns.py", line 932, in match
result = pattern.match(value, context)
[Previous line repeated 1 more time]
File "/Users/hornm/Software/ibis/ibis/common/patterns.py", line 391, in match
return self.replacer.resolve(context)
File "/Users/hornm/Software/ibis/ibis/common/deferred.py", line 326, in resolve
return self.func(**context)
TypeError: offset_to_filter() got an unexpected keyword argument 'y'
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
@ExpectationMax Thanks for the issue! Great job on analyzing the problem. Care to submit a PR and we can continue the discussion there?
Hey @cpcloud, happy to send a PR. Some advice ahead of time would be helpful though. Mainly I am not sure if the sharing of context is somehow and thus if the issue should best be fixed by changing the position at which the dict is defined or simply by adding **kwargs to the rewrite rule.
Generally, it looks like all rewrite rules contain a kwargs parameter which I suspect is the reason why this problem did not pop up earlier. Is there a reason for this parameter to be present and what type of information would typically be passed via kwargs?
Accumulating the context is unintentional as far as I can tell.
PR incoming with the suggested change of localizing the context mapping.