narwhals icon indicating copy to clipboard operation
narwhals copied to clipboard

[Enh]: Allow `LazyFrame.with_row_index(..., order_by=...)`

Open FBruzzesi opened this issue 9 months ago • 19 comments

Description

Proposing to allow the following: LazyFrame.with_row_index(name, order_by=into_exprs)

This would be a deviation from polars itself. In a way this is similar to #2174

Suggested implementation

I am missing some detail for sure, but it would be something close to:

self.with_column(**{name: nw.lit(1).cum_sum().over(order_by=order_by) - 1})

FBruzzesi avatar Mar 28 '25 13:03 FBruzzesi

@FBruzzesi this is news to me, seems like polars added LazyFrame.with_row_index around 0.20.*

  • https://docs.pola.rs/api/python/stable/reference/lazyframe/api/polars.LazyFrame.with_row_index.html

I've only found it mentioned in relation to the new streaming engine in issues.

The last time I tried mixing an index with anything LazyFrame, I always found it much more reliable to just do:

LazyFrame.collect().with_row_index().lazy()

However things may have improved and it there might be stronger ordering guarantees

dangotbanned avatar Apr 04 '25 12:04 dangotbanned

out of interest, what's the use-case?

i'd be more inclined to deprecate this for lazyframe, just having it around feels like it's got the potential to be misued

MarcoGorelli avatar Apr 04 '25 13:04 MarcoGorelli

out of interest, what's the use-case?

+1

i'd be more inclined to deprecate this for lazyframe, just having it around feels like it's got the potential to be misued

Huh, I didn't know we supported it yet for nw.LazyFrame 🤔 I agree though on the potential for misuse

dangotbanned avatar Apr 04 '25 13:04 dangotbanned

out of interest, what's the use-case?

I sporadically check if some code I have for personal or work use can be narwhalified. This is one of the feature that we have but basically it's supported only by polars (or eager frame casted to lazy)

just having it around feels like it's got the potential to be misued

Not sure why that would be different than methods that require .over(order_by). But yes I would rather deprecate it than keeping it for polars only

FBruzzesi avatar Apr 04 '25 13:04 FBruzzesi

Thinking about this further, I think the "subset of Polars" way to do this would be to introduce nw.int_range, and then use that with over

So my suggestion would be:

  • deprecate LazyFrame.with_row_index
  • introduce nw.int_range, which is a window function, meaning that for LazyFrame it needs to be followed by over(..., order_by=...)

Does this sound good? If so, @FBruzzesi did it interest you to try this out?

MarcoGorelli avatar Apr 26 '25 16:04 MarcoGorelli

I've been waiting for someone to suggest supporting any of the pl.*_range functions 🎉

@MarcoGorelli could you burst my bubble and let me know that only nw.int_range is a realistic option?

dangotbanned avatar Apr 26 '25 16:04 dangotbanned

😄 which one(s) did you want to avoid?

MarcoGorelli avatar Apr 26 '25 16:04 MarcoGorelli

😄 which one(s) did you want to avoid?

Avoid? Did you mean add?

I'd be interested in the temporal ones, like pl.date_range - since we'd need to start thinking more about duration string language
Realistically though, those worms may be best left canned 😉

dangotbanned avatar Apr 26 '25 17:04 dangotbanned

that's a good one

i'd like to avoid the horizontal ones for now (int_ranges, date_ranges, ...)

MarcoGorelli avatar Apr 26 '25 18:04 MarcoGorelli

Does this sound good? If so, @FBruzzesi did it interest you to try this out?

Yes it definitly does! I will try to tackle it - if you don't see anything by end of week it might mean I had no much time 😅

FBruzzesi avatar Apr 28 '25 07:04 FBruzzesi

Update: while this is straightforward for eager backends, for lazy ones int_range is fairly more complex than a simple row_index. However I don't really see a point in supporting int_range for eager backends only since this issue started with the request of supporting something for lazy ones 🥲

Based on the following statement:

if you don't see anything by end of week it might mean I had no much time 😅

you can now assume I didn't finish this in time 😂

FBruzzesi avatar May 04 '25 16:05 FBruzzesi

https://github.com/narwhals-dev/narwhals/issues/2307#issuecomment-2849307655

@FBruzzesi what were some of the issues you ran into for lazy backends?

My, probably naive, thinking was that row_number() would be what we're mapping to?

dangotbanned avatar May 05 '25 20:05 dangotbanned

Hey @dangotbanned yes sorry for not expanding further. There are various approaches that could work, but in practice int_range can allow both start and end values. For lazy backends, the only end value which makes sense is start + nw.len() * step (unless someone is really certain to know how many rows an operation will end up with?).

Also, row_number in window context will easily work without additional arithmetic ops (such as start + row_number() * step).

So, from what I can see, the only out of the box implementation is start=0, end=nw.len(), step=1, which then yes, it can be done via row_number.

However, I would not be very excited to implement int_range if the param values must be fixed - what do you think?

FBruzzesi avatar May 06 '25 09:05 FBruzzesi

https://github.com/narwhals-dev/narwhals/issues/2307#issuecomment-2853847453

Thanks for explaining @FBruzzesi

This isn't a response, but just spotted something which might be useful? Even if it just gives us an idea of what support might look like in different backends

  • https://github.com/ibis-project/ibis/blob/9ef7ca7d3dc91ddffb54514579792c7c9f9ab6b2/ibis/expr/api.py#L2270-L2423

dangotbanned avatar May 06 '25 16:05 dangotbanned

Thinking about this further, isn't .rank('ordinal').over(order_by=...) all we need?

In [26]: df = pl.DataFrame({'a': [7, 9, 8, 8, 6, 9]})

In [27]: df.with_columns(row_index=pl.col('a').rank('ordinal')-1)
Out[27]:
shape: (6, 2)
┌─────┬───────────┐
│ a   ┆ row_index │
│ --- ┆ ---       │
│ i64 ┆ u32       │
╞═════╪═══════════╡
│ 7   ┆ 1         │
│ 9   ┆ 4         │
│ 8   ┆ 2         │
│ 8   ┆ 3         │
│ 6   ┆ 0         │
│ 9   ┆ 5         │
└─────┴───────────┘

In [28]: df.with_columns(row_index=pl.int_range(pl.len()).over(pl.lit(1), order_by='a'))
Out[28]:
shape: (6, 2)
┌─────┬───────────┐
│ a   ┆ row_index │
│ --- ┆ ---       │
│ i64 ┆ i64       │
╞═════╪═══════════╡
│ 7   ┆ 1         │
│ 9   ┆ 4         │
│ 8   ┆ 2         │
│ 8   ┆ 3         │
│ 6   ┆ 0         │
│ 9   ┆ 5         │
└─────┴───────────┘

It's cleaner, and also gives a chance to teach people about rank in the error message, which can only be a good thing IMO (rank seems to be severly underused!)


rank('ordinal') also happens to already be implemented in Narwhals 😎

MarcoGorelli avatar May 07 '25 09:05 MarcoGorelli

Correction: rank('ordinal') would result in None for original null elements, so it's not really a full replacement

MarcoGorelli avatar May 07 '25 10:05 MarcoGorelli

Should we split the nw.<dtype>_range functions into a separate issue?

rank("ordinal") sounds interesting and seems to be using row_number in SparkLikeExpr

https://github.com/narwhals-dev/narwhals/blob/f4c9a20f666be10d1209660e7d636deff47345a4/narwhals/_spark_like/expr.py#L795-L796

We're missing an impl for DaskExpr though

https://github.com/narwhals-dev/narwhals/blob/f4c9a20f666be10d1209660e7d636deff47345a4/narwhals/_dask/expr.py#L694

dangotbanned avatar May 07 '25 10:05 dangotbanned

The workaround exists, it's enough for row_index, but not for int_range.

The proposal here could be to have a doc page in which we add some functionality which are generally useful, but not officially supported - because not in the API (and not necessarily tested). I think I mentioned something similar in #267

In this case:

def with_row_index(frame: FrameT, name: str, order_by: str) -> FrameT:
    return (
        frame
        .with_columns(**{name: nw.lit(1)})
        .with_columns(**{name: nw.col(name).cum_sum().over(order_by=order_by) - 1})
    )    

FBruzzesi avatar May 07 '25 21:05 FBruzzesi

thinking about this further, I think the original proposal of having order_by be an argument of {LazyFrame, DataFrame}.with_row_index is fine. There should probably also be partition_by then

For DataFrame they'd be optional, for LazyFrame order_by would be required

MarcoGorelli avatar May 11 '25 09:05 MarcoGorelli

hey @FBruzzesi - do you have a solution for

The workaround exists, it's enough for row_index, but not for int_range.

?

I think if you have something that allows us to do LazyFrame.with_row_index(..., order_by=...), then that's all we need for now. After having gone round in circle, i'm persuaded enough by your original suggestion

MarcoGorelli avatar Jun 13 '25 15:06 MarcoGorelli

@MarcoGorelli

hey @FBruzzesi - do you have a solution for

The workaround exists, it's enough for row_index, but not for int_range. ?

the following for lazy backends (adapted to be a method):

def with_row_index(frame: FrameT, name: str, order_by: str) -> FrameT:
    return (
        frame
        .with_columns(**{name: nw.lit(1)})
        .with_columns(**{name: nw.col(name).cum_sum().over(order_by=order_by) - 1})
    )

Just for clarity:

  • Should we allow for a partition_by as well? If so, I think we need branching for polars, but that's fine
  • Do you have another idea in mind?

FBruzzesi avatar Jun 14 '25 07:06 FBruzzesi

thanks!

Should we allow for a partition_by as well? If so, I think we need branching for polars, but that's fine

this wouldn't need to be required, so ok to leave it out for now

Should we allow for a partition_by as well? If so, I think we need branching for polars, but that's fine

i think for many backends it would be more performant to use row_number() over (partition by ... order by ...) directly, rather than doing 2 passes. could we do that, and use the cum_sum trick for backends where that wouldn't be possible?

MarcoGorelli avatar Jun 14 '25 07:06 MarcoGorelli

in pandas this is groupby(...).cumcount()


almost working for pyarrow (requires handling null values, and grouping by more than 1 array):

In [61]: def func(arr):
    ...:     import numpy as np
    ...:     group_changes = pa.concat_arrays(
    ...:         [
    ...:              pa.array([True]),
    ...:              pc.invert(pc.equal(arr[:-1], arr[1:])).combine_chunks()
    ...:         ]
    ...:     ).cast(pa.int64())
    ...:     group_ids = np.cumsum(group_changes)
    ...:     result = np.arange(len(arr)) - np.where(group_changes)[0][group_ids - 1]
    ...:
    ...:     return pa.array(result)

MarcoGorelli avatar Jun 14 '25 18:06 MarcoGorelli

@MarcoGorelli are you working on this? I also started 🙈

Also:

in pandas this is groupby(...).cumcount()

wouldn't this be useful for partition_by? I understood to first implement order_by only, and maybe later to allow for partition_by

From:

this wouldn't need to be required, so ok to leave it out for now

FBruzzesi avatar Jun 15 '25 13:06 FBruzzesi

i haven't started on this, no, i was just curious about if pandas had a native way to do it

yup that's right, not necessary to have partition_by / use groupby(...).cumcount() yet, sorry for misleading!

MarcoGorelli avatar Jun 15 '25 13:06 MarcoGorelli