narwhals icon indicating copy to clipboard operation
narwhals copied to clipboard

[Enh]: Add `Expr.any_value`?

Open danielgafni opened this issue 1 month ago • 4 comments

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

https://github.com/anam-org/metaxy

Please describe the purpose of the new feature or describe the problem to solve.

I would like to perform a group by on a DataFrame and take a random (any) row from each group.

I could have used df.group_by().agg(nw.all().first()), because rows within groups are not sorted already, but this currently fails with:

E           narwhals.exceptions.InvalidOperationError: Order-dependent expressions are not supported for use in LazyFrame.
E
E           Hint: To make the expression valid, use `.over` with `order_by` specified.
E
E           For example, if you wrote `nw.col('price').cum_sum()` and you have a column
E           `'date'` which orders your data, then replace:
E
E              nw.col('price').cum_sum()
E
E            with:
E
E              nw.col('price').cum_sum().over(order_by='date')
E                                       ^^^^^^^^^^^^^^^^^^^^^^
E
E           See https://narwhals-dev.github.io/narwhals/concepts/order_dependence/.

In my case performing a sort would introduce a performance penalty, so I would like to avoid it.

It seems like the best I could do now is with .is_first_distinct() over a dummy column:

def group_and_take_any(df: FrameT, cols: Sequence[str]) -> FrameT:
     return (
            df.with_columns(nw.lit(True).alias("_dummy"))
            .filter(
                nw.col("_dummy")
                .is_first_distinct()
                .over(*cols, order_by="_dummy")
            )
            .drop("_dummy")
        )

Suggest a solution if possible.

No response

If you have tried alternatives, please describe them below.

No response

Additional information that may help us understand your needs.

No response

danielgafni avatar Nov 10 '25 12:11 danielgafni

By the way, it looks like having within a GroupBy in Polars will be available in the future:

https://github.com/pola-rs/polars/pull/23550

danielgafni avatar Nov 10 '25 12:11 danielgafni

thanks for the request!

I like the idea of having any_value, even though it's not available in Polars

MarcoGorelli avatar Nov 10 '25 13:11 MarcoGorelli

Wouldn't Expr.sample(n=1) be what we are after without having to diverge from the polars API?

FBruzzesi avatar Nov 15 '25 12:11 FBruzzesi

that doesn't really behave like an aggregation though

In [9]: df = pl.DataFrame({'a': [1,1,2], 'b': [4,5,6]})

In [10]: df.group_by('a').agg(pl.col('b').sample(n=1))
Out[10]:
shape: (2, 2)
┌─────┬───────────┐
│ a   ┆ b         │
│ --- ┆ ---       │
│ i64 ┆ list[i64] │
╞═════╪═══════════╡
│ 1   ┆ [5]       │
│ 2   ┆ [6]       │
└─────┴───────────┘

MarcoGorelli avatar Nov 15 '25 12:11 MarcoGorelli