ibis
ibis copied to clipboard
feat(sql): make filter with window operation more convenient for SQL-based backends
For example, the following would fail in SQL, mysql, sqlite, postgres, impala, clickhouse, spark:
window = ibis.window(group_by=table.id)
table = table.filter(lambda t: t['id'].mean().over(window) > 3).sort_by(
'id'
)
with the following type of error:
self = <[AttributeError('expr') raised in repr()] WindowOp object at 0x7f309166f210>
expr = ref_0
ClickhouseDatabaseTable[table]
name: ibis_testing.functional_alltypes
schema:
date: date[non-nullable...ble
ref_1
where:
None
mean = TableArrayView[float64*[
table:
Table: ref_2
name:
mean
window = <ibis.expr.window.Window object at 0x7f30916b9bd0>
def __init__(self, expr, window):
from ibis.expr.analysis import is_analytic
from ibis.expr.window import propagate_down_window
if not is_analytic(expr):
raise com.IbisInputError(
'Expression does not contain a valid window operation'
)
E ibis.common.exceptions.IbisInputError: Expression does not contain a valid window operation.
Compare this to the behavior in the Pyspark backend, for example, which would have
mean = Mean[float64]
instead of
mean = TableArrayView[float64*],
so the window op is valid since it contains an analytic expression.
It's totally reasonable to try this initially, because it's pretty natural to use a function that generates a column as a filter!
This might be possible to do automatically, in any event here's a workaround:
window = ibis.window(group_by=table.id)
table = table.mutate(my_bool_col=lambda t: t['id'].mean().over(window) > 3)
table = table.filter(table.my_bool_col).sort_by('id')
Removing the bug label, this is working as expected so it's a feature request to make it more convenient to filter based on the value of a window function.
This can be mapped to SQL's QUALIFY syntax for example, https://duckdb.org/docs/sql/query_syntax/qualify:
SELECT
schema_name,
function_name,
-- In this example the function_rank column in the select clause is for reference
row_number() over (partition by schema_name order by function_name) as function_rank
FROM duckdb_functions()
QUALIFY
row_number() over (partition by schema_name order by function_name) < 3;
This would be nice, PRs welcome!