ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat(sql): make filter with window operation more convenient for SQL-based backends

Open emilyreff7 opened this issue 4 years ago • 3 comments

For example, the following would fail in SQL, mysql, sqlite, postgres, impala, clickhouse, spark:

  window = ibis.window(group_by=table.id)

  table = table.filter(lambda t: t['id'].mean().over(window) > 3).sort_by(
      'id'
  )

with the following type of error:

self = <[AttributeError('expr') raised in repr()] WindowOp object at 0x7f309166f210>
expr = ref_0
ClickhouseDatabaseTable[table]
    name: ibis_testing.functional_alltypes
    schema:
       date: date[non-nullable...ble
           ref_1
         where:
           None

mean = TableArrayView[float64*[
    table:
       Table: ref_2
    name:
       mean
window = <ibis.expr.window.Window object at 0x7f30916b9bd0>

       def __init__(self, expr, window):
           from ibis.expr.analysis import is_analytic
           from ibis.expr.window import propagate_down_window

           if not is_analytic(expr):
               raise com.IbisInputError(
                   'Expression does not contain a valid window operation'
               )
E             ibis.common.exceptions.IbisInputError: Expression does not contain a valid window operation.

Compare this to the behavior in the Pyspark backend, for example, which would have mean = Mean[float64] instead of mean = TableArrayView[float64*], so the window op is valid since it contains an analytic expression.

emilyreff7 avatar Mar 23 '21 15:03 emilyreff7

It's totally reasonable to try this initially, because it's pretty natural to use a function that generates a column as a filter!

This might be possible to do automatically, in any event here's a workaround:

window = ibis.window(group_by=table.id)

table = table.mutate(my_bool_col=lambda t: t['id'].mean().over(window) > 3)
table = table.filter(table.my_bool_col).sort_by('id')

cpcloud avatar Jan 12 '22 21:01 cpcloud

Removing the bug label, this is working as expected so it's a feature request to make it more convenient to filter based on the value of a window function.

cpcloud avatar Apr 19 '22 07:04 cpcloud

This can be mapped to SQL's QUALIFY syntax for example, https://duckdb.org/docs/sql/query_syntax/qualify:

SELECT 
    schema_name, 
    function_name, 
    -- In this example the function_rank column in the select clause is for reference 
    row_number() over (partition by schema_name order by function_name) as function_rank 
FROM duckdb_functions() 
QUALIFY 
    row_number() over (partition by schema_name order by function_name) < 3;

cpcloud avatar Apr 27 '22 20:04 cpcloud

This would be nice, PRs welcome!

cpcloud avatar Apr 22 '23 11:04 cpcloud