[Enh]: Allow `LazyFrame.with_row_index(..., order_by=...)`
Description
Proposing to allow the following: LazyFrame.with_row_index(name, order_by=into_exprs)
This would be a deviation from polars itself. In a way this is similar to #2174
Suggested implementation
I am missing some detail for sure, but it would be something close to:
self.with_column(**{name: nw.lit(1).cum_sum().over(order_by=order_by) - 1})
@FBruzzesi this is news to me, seems like polars added LazyFrame.with_row_index around 0.20.*
- https://docs.pola.rs/api/python/stable/reference/lazyframe/api/polars.LazyFrame.with_row_index.html
I've only found it mentioned in relation to the new streaming engine in issues.
The last time I tried mixing an index with anything LazyFrame, I always found it much more reliable to just do:
LazyFrame.collect().with_row_index().lazy()
However things may have improved and it there might be stronger ordering guarantees
out of interest, what's the use-case?
i'd be more inclined to deprecate this for lazyframe, just having it around feels like it's got the potential to be misued
out of interest, what's the use-case?
+1
i'd be more inclined to deprecate this for
lazyframe, just having it around feels like it's got the potential to be misued
Huh, I didn't know we supported it yet for nw.LazyFrame 🤔
I agree though on the potential for misuse
out of interest, what's the use-case?
I sporadically check if some code I have for personal or work use can be narwhalified. This is one of the feature that we have but basically it's supported only by polars (or eager frame casted to lazy)
just having it around feels like it's got the potential to be misued
Not sure why that would be different than methods that require .over(order_by). But yes I would rather deprecate it than keeping it for polars only
Thinking about this further, I think the "subset of Polars" way to do this would be to introduce nw.int_range, and then use that with over
So my suggestion would be:
- deprecate
LazyFrame.with_row_index - introduce
nw.int_range, which is a window function, meaning that forLazyFrameit needs to be followed byover(..., order_by=...)
Does this sound good? If so, @FBruzzesi did it interest you to try this out?
I've been waiting for someone to suggest supporting any of the pl.*_range functions 🎉
@MarcoGorelli could you burst my bubble and let me know that only nw.int_range is a realistic option?
😄 which one(s) did you want to avoid?
😄 which one(s) did you want to avoid?
Avoid? Did you mean add?
I'd be interested in the temporal ones, like pl.date_range - since we'd need to start thinking more about duration string language
Realistically though, those worms may be best left canned 😉
that's a good one
i'd like to avoid the horizontal ones for now (int_ranges, date_ranges, ...)
Does this sound good? If so, @FBruzzesi did it interest you to try this out?
Yes it definitly does! I will try to tackle it - if you don't see anything by end of week it might mean I had no much time 😅
Update: while this is straightforward for eager backends, for lazy ones int_range is fairly more complex than a simple row_index. However I don't really see a point in supporting int_range for eager backends only since this issue started with the request of supporting something for lazy ones 🥲
Based on the following statement:
if you don't see anything by end of week it might mean I had no much time 😅
you can now assume I didn't finish this in time 😂
https://github.com/narwhals-dev/narwhals/issues/2307#issuecomment-2849307655
@FBruzzesi what were some of the issues you ran into for lazy backends?
My, probably naive, thinking was that row_number() would be what we're mapping to?
Hey @dangotbanned yes sorry for not expanding further.
There are various approaches that could work, but in practice int_range can allow both start and end values. For lazy backends, the only end value which makes sense is start + nw.len() * step (unless someone is really certain to know how many rows an operation will end up with?).
Also, row_number in window context will easily work without additional arithmetic ops (such as start + row_number() * step).
So, from what I can see, the only out of the box implementation is start=0, end=nw.len(), step=1, which then yes, it can be done via row_number.
However, I would not be very excited to implement int_range if the param values must be fixed - what do you think?
https://github.com/narwhals-dev/narwhals/issues/2307#issuecomment-2853847453
Thanks for explaining @FBruzzesi
This isn't a response, but just spotted something which might be useful? Even if it just gives us an idea of what support might look like in different backends
- https://github.com/ibis-project/ibis/blob/9ef7ca7d3dc91ddffb54514579792c7c9f9ab6b2/ibis/expr/api.py#L2270-L2423
Thinking about this further, isn't .rank('ordinal').over(order_by=...) all we need?
In [26]: df = pl.DataFrame({'a': [7, 9, 8, 8, 6, 9]})
In [27]: df.with_columns(row_index=pl.col('a').rank('ordinal')-1)
Out[27]:
shape: (6, 2)
┌─────┬───────────┐
│ a ┆ row_index │
│ --- ┆ --- │
│ i64 ┆ u32 │
╞═════╪═══════════╡
│ 7 ┆ 1 │
│ 9 ┆ 4 │
│ 8 ┆ 2 │
│ 8 ┆ 3 │
│ 6 ┆ 0 │
│ 9 ┆ 5 │
└─────┴───────────┘
In [28]: df.with_columns(row_index=pl.int_range(pl.len()).over(pl.lit(1), order_by='a'))
Out[28]:
shape: (6, 2)
┌─────┬───────────┐
│ a ┆ row_index │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═══════════╡
│ 7 ┆ 1 │
│ 9 ┆ 4 │
│ 8 ┆ 2 │
│ 8 ┆ 3 │
│ 6 ┆ 0 │
│ 9 ┆ 5 │
└─────┴───────────┘
It's cleaner, and also gives a chance to teach people about rank in the error message, which can only be a good thing IMO (rank seems to be severly underused!)
rank('ordinal') also happens to already be implemented in Narwhals 😎
Correction: rank('ordinal') would result in None for original null elements, so it's not really a full replacement
Should we split the nw.<dtype>_range functions into a separate issue?
rank("ordinal") sounds interesting and seems to be using row_number in SparkLikeExpr
https://github.com/narwhals-dev/narwhals/blob/f4c9a20f666be10d1209660e7d636deff47345a4/narwhals/_spark_like/expr.py#L795-L796
We're missing an impl for DaskExpr though
https://github.com/narwhals-dev/narwhals/blob/f4c9a20f666be10d1209660e7d636deff47345a4/narwhals/_dask/expr.py#L694
The workaround exists, it's enough for row_index, but not for int_range.
The proposal here could be to have a doc page in which we add some functionality which are generally useful, but not officially supported - because not in the API (and not necessarily tested). I think I mentioned something similar in #267
In this case:
def with_row_index(frame: FrameT, name: str, order_by: str) -> FrameT:
return (
frame
.with_columns(**{name: nw.lit(1)})
.with_columns(**{name: nw.col(name).cum_sum().over(order_by=order_by) - 1})
)
thinking about this further, I think the original proposal of having order_by be an argument of {LazyFrame, DataFrame}.with_row_index is fine. There should probably also be partition_by then
For DataFrame they'd be optional, for LazyFrame order_by would be required
hey @FBruzzesi - do you have a solution for
The workaround exists, it's enough for row_index, but not for int_range.
?
I think if you have something that allows us to do LazyFrame.with_row_index(..., order_by=...), then that's all we need for now. After having gone round in circle, i'm persuaded enough by your original suggestion
@MarcoGorelli
hey @FBruzzesi - do you have a solution for
The workaround exists, it's enough for row_index, but not for int_range. ?
the following for lazy backends (adapted to be a method):
def with_row_index(frame: FrameT, name: str, order_by: str) -> FrameT:
return (
frame
.with_columns(**{name: nw.lit(1)})
.with_columns(**{name: nw.col(name).cum_sum().over(order_by=order_by) - 1})
)
Just for clarity:
- Should we allow for a
partition_byas well? If so, I think we need branching for polars, but that's fine - Do you have another idea in mind?
thanks!
Should we allow for a partition_by as well? If so, I think we need branching for polars, but that's fine
this wouldn't need to be required, so ok to leave it out for now
Should we allow for a partition_by as well? If so, I think we need branching for polars, but that's fine
i think for many backends it would be more performant to use row_number() over (partition by ... order by ...) directly, rather than doing 2 passes. could we do that, and use the cum_sum trick for backends where that wouldn't be possible?
in pandas this is groupby(...).cumcount()
almost working for pyarrow (requires handling null values, and grouping by more than 1 array):
In [61]: def func(arr):
...: import numpy as np
...: group_changes = pa.concat_arrays(
...: [
...: pa.array([True]),
...: pc.invert(pc.equal(arr[:-1], arr[1:])).combine_chunks()
...: ]
...: ).cast(pa.int64())
...: group_ids = np.cumsum(group_changes)
...: result = np.arange(len(arr)) - np.where(group_changes)[0][group_ids - 1]
...:
...: return pa.array(result)
@MarcoGorelli are you working on this? I also started 🙈
Also:
in pandas this is groupby(...).cumcount()
wouldn't this be useful for partition_by? I understood to first implement order_by only, and maybe later to allow for partition_by
From:
this wouldn't need to be required, so ok to leave it out for now
i haven't started on this, no, i was just curious about if pandas had a native way to do it
yup that's right, not necessary to have partition_by / use groupby(...).cumcount() yet, sorry for misleading!