ibis icon indicating copy to clipboard operation
ibis copied to clipboard

bug: Context shared across multiple rewrite operations

Open ExpectationMax opened this issue 1 year ago • 2 comments

What happened?

Running the code below results in a error due to a pyspark specific rewrite rule.

import ibis
from pyspark.sql import SparkSession
backend = ibis.pyspark.connect(SparkSession.builder.getOrCreate())
ibis.set_backend(backend)

table = ibis.memtable({"test": [1, 2, 3, 4, 5]})
result = table.mutate(new_col=ibis.ntile(2).over(order_by=ibis.random())).limit(10).to_pandas()

The issue can be trivially solved by adding a **kwargs argument to the offset_to_filter rewrite rule: https://github.com/ibis-project/ibis/blob/5bb4971863b2060f1545de425eacb40168b541e5/ibis/backends/pyspark/compiler.py#L26

Nevertheless, it seems unusual that this rewrite rule should be provided with the parameter y. Doing some investigation in the code it seems like a previous rewrite rule add_order_by_to_empty_ranking_window_functions is responsible for the addition of the parameter: https://github.com/ibis-project/ibis/blob/5bb4971863b2060f1545de425eacb40168b541e5/ibis/backends/sql/rewrites.py#L321

I am not familiar with the details of ibis's implementation I am not entirely sure if this is intentional behavior. I could track the root cause to a shared dictionary in the replacement logic which is created once and then shared among all rewrite operations: https://github.com/ibis-project/ibis/blob/5bb4971863b2060f1545de425eacb40168b541e5/ibis/common/graph.py#L179

In a quick test I fixed the issue by moving the creation of the dictionary into the local function here, which ensures that ctx is always local to the the current pattern being processed and that state does not carry over to other rewrite rules. Interestingly, this does not seem to break any of the tests, so I assume this behaviour might be unintentional. A better fix might be to ensure that the context variable is not changed when calling obj.match(recreated, ctx) but that depends on the design considerations of the project.

Happy to provide a fix PR after some guidance on what the right way to move forward would be 🙂 .

What version of ibis are you using?

9.1.0

What backend(s) are you using, if any?

PySpark, issue seems to be independent of backend though.

Relevant log output

Traceback (most recent call last):
  File "/Users/hornm/Software/ibis/investigate_issue.py", line 7, in <module>
    result = table.mutate(new_col=ibis.ntile(2).over(order_by=ibis.random())).limit(10).to_pandas()
  File "/Users/hornm/Software/ibis/ibis/expr/types/relations.py", line 3545, in to_pandas
    return self.execute(**kwargs)
  File "/Users/hornm/Software/ibis/ibis/expr/types/core.py", line 393, in execute
    return self._find_backend(use_default=True).execute(
  File "/Users/hornm/Software/ibis/ibis/backends/pyspark/__init__.py", line 363, in execute
    sql = self.compile(table, params=params, limit=limit, **kwargs)
  File "/Users/hornm/Software/ibis/ibis/backends/sql/__init__.py", line 177, in compile
    query = self._to_sqlglot(expr, limit=limit, params=params, **kwargs)
  File "/Users/hornm/Software/ibis/ibis/backends/sql/__init__.py", line 159, in _to_sqlglot
    sql = self.compiler.translate(table_expr.op(), params=params)
  File "/Users/hornm/Software/ibis/ibis/backends/sql/compiler.py", line 525, in translate
    op, ctes = sqlize(
  File "/Users/hornm/Software/ibis/ibis/backends/sql/rewrites.py", line 284, in sqlize
    node = node.replace(reduce(operator.or_, rewrites))
  File "/Users/hornm/Software/ibis/ibis/common/graph.py", line 464, in replace
    results = self.map(replacer, filter=filter)
  File "/Users/hornm/Software/ibis/ibis/common/graph.py", line 265, in map
    results[node] = fn(node, results, **kwargs)
  File "/Users/hornm/Software/ibis/ibis/common/graph.py", line 188, in fn
    if (result := obj.match(recreated, ctx)) is NoMatch:
  File "/Users/hornm/Software/ibis/ibis/common/patterns.py", line 932, in match
    result = pattern.match(value, context)
  File "/Users/hornm/Software/ibis/ibis/common/patterns.py", line 932, in match
    result = pattern.match(value, context)
  File "/Users/hornm/Software/ibis/ibis/common/patterns.py", line 932, in match
    result = pattern.match(value, context)
  [Previous line repeated 1 more time]
  File "/Users/hornm/Software/ibis/ibis/common/patterns.py", line 391, in match
    return self.replacer.resolve(context)
  File "/Users/hornm/Software/ibis/ibis/common/deferred.py", line 326, in resolve
    return self.func(**context)
TypeError: offset_to_filter() got an unexpected keyword argument 'y'

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

ExpectationMax avatar Jun 21 '24 16:06 ExpectationMax

@ExpectationMax Thanks for the issue! Great job on analyzing the problem. Care to submit a PR and we can continue the discussion there?

cpcloud avatar Jun 21 '24 17:06 cpcloud

Hey @cpcloud, happy to send a PR. Some advice ahead of time would be helpful though. Mainly I am not sure if the sharing of context is somehow and thus if the issue should best be fixed by changing the position at which the dict is defined or simply by adding **kwargs to the rewrite rule. Generally, it looks like all rewrite rules contain a kwargs parameter which I suspect is the reason why this problem did not pop up earlier. Is there a reason for this parameter to be present and what type of information would typically be passed via kwargs?

ExpectationMax avatar Jun 22 '24 08:06 ExpectationMax

Accumulating the context is unintentional as far as I can tell.

cpcloud avatar Aug 10 '24 12:08 cpcloud

PR incoming with the suggested change of localizing the context mapping.

cpcloud avatar Aug 10 '24 12:08 cpcloud