feat: add `first_value` and `last_value` metrics with and "ignore nulls" option for use in a window
Is your feature request related to a problem?
I would like to implement an efficient, pandas-compatible bfill and ffill in BigQuery DataFrames without custom operators.
What is the motivation behind your request?
To implement bfill and ffill in BigQuery DataFrames (see: https://github.com/ibis-project/ibis/issues/3652 for an issue requesting bfill and ffill in ibis), we add two new operations to Ibis: FirstNonNullValue and LastNonNullValue.
This is then used like FirstNonNullValue(column).over(window), which when the window is ibis.range_window(preceding=0, following=limit), implements a pandas-like bfill with limit parameter support.
Likewise, use LastNonNullValue(column).over(window) with a window of ibis.range_window(preceding=limit, following=0) to implement a pandas-like ffill with limit parameter support.
Describe the solution you'd like
I propose the following two new "metrics", which require a window to operate:
value.first(ignore_nulls=True).over(window)
value.last(ignore_nulls=True).over(window)
Related: IGNORE NULLS in general is tracked here: https://github.com/ibis-project/ibis/issues/7649
Previously, with string-based compilation, this was accomplished with a custom BigQuery operator like:
https://github.com/googleapis/python-bigquery-dataframes/blob/74170dabd323f1b08ad76241e37ff9f2a5b67ab5/third_party/bigframes_vendored/ibis/backends/bigquery/registry.py#L15-L17
In SQLGlot, it does appear there is a First expression with at least some IGNORE NULLS support. My work is in-progress to see if this operator is compatible with the BigQuery dialect.
What version of ibis are you running?
8.x, but working on an upgrade to 9.x right now.
What backend(s) are you using, if any?
BigQuery
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
In SQLGlot, it does appear there is a First expression with at least some IGNORE NULLS support. My work is in-progress to see if this operator is compatible with the BigQuery dialect.
I see also FirstValue which according to the comment is specifically for windowed contexts. Neither is parametrized, so I'm not sure about IGNORE NULLS support yet.
Filed https://github.com/tobymao/sqlglot/issues/3749 as I am having trouble figuring out how to create the SQLGlot expression. Other examples I found in the compiler use self.agg.avg(...), but I don't think that would account for the IGNORE NULLS part.
Update:
def visit_FirstNonNullValue(self, op, *, arg):
return sge.IgnoreNulls(this=sge.FirstValue(this=arg))
This works for the bfill implementation in BigQuery DataFrames. I imagine it should work in most SQLGlot backends. Probably makes sense to make IgnoreNulls a parameter in ibis.
Most of the team is at SciPy through the end of this week. This makes sense to me, but would love to get thoughts from @cpcloud and the other maintainers.
We should definitely add this functionality, but likely as a flag on the ops that support, which is more than just first/last, so that we don't end up with a set of classes duplicated only to allow them to set ignore_nulls.
@tswast Any chance someone from the BigFrames team wants to take a crack at it?
Note that recently we made first/last/collect uniformly ignore nulls in non-window expressions (previously it was backend-dependent). Another option is to make this also true when used in window expressions, which would both improve uniformity and avoid the need for a new flag.
🤔 Perhaps we only expose ignore_nulls on lag/lead which is what an implementation of bfill/ffill would be based on.
This feature would be useful to Ibis support in Narwhals
Note that for other backends (e.g. duckdb / pyspark) we don't use lag / lead, but last_value / first_value
Here's how we implement ffill using last_value - note that ignore_nulls is required for this to work correctly:
import duckdb
import narwhals as nw
import sqlglot
conn = duckdb.connect()
conn.sql("""CREATE TABLE prices (date DATE, price DOUBLE);""")
df = (
nw.from_native(conn.table("prices"))
.with_columns(price_filled=nw.col('price').fill_null(strategy='forward').over(order_by='date'))
)
print(sqlglot.transpile(df.to_native().sql_query(), pretty=True)[0])
SELECT
date,
price,
LAST_VALUE(price) IGNORE NULLS OVER (ORDER BY date ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS price_filled
FROM main.prices