narwhals fix: pyarrow `unique` in `group

What type of PR is this? (check all applicable)

[ ] 💾 Refactor
[x] ✨ Feature
[x] 🐛 Bug Fix
[ ] 🔧 Optimization
[ ] 📝 Documentation
[ ] ✅ Test
[ ] 🐳 Other

Related issues

May come in handy for plotly

Checklist

[x] Code follows style guide (ruff)
[ ] Tests added
[ ] Documented the changes

If you have comments or can explain your changes, please do so below.

I was not able to add tests... I tried to nest a bunch of checks but also the order inside the list type is not guaranteed.. any idea?

Sep 26 '24 20:09 FBruzzesi

thanks @FBruzzesi !

I think plotly would only need to get some value from the aggregation, rather than a list dtypes?

perhaps we could allow

df.group_by('a').agg(nw.unique_value('b'))  # raises if there's more than 1 unique value per group
df.group_by('a').agg(nw.unique_value('b', fallback_value='(?)'))

Something like this could help address the mode issue you'd spotted in skrub, iirc they just wanted to get a single value from the mode out, right?

Sep 27 '24 09:09 MarcoGorelli

Not sure if this is the right place for this discussion but here we go 🙃

I think plotly would only need to get some value from the aggregation, rather than a list dtypes?

Yes correct!

perhaps we could allow

df.group_by('a').agg(nw.unique_value('b'))  # raises if there's more than 1 unique value per group
df.group_by('a').agg(nw.unique_value('b', fallback_value='(?)'))

Not the biggest fan of this if we are going to support nw.List type - a list is an aggregated value, and surprisingly pandas seems to behave quite well - at least for .unique 🙈

Something like this could help address the mode issue you'd spotted in skrub, iirc they just wanted to get a single value from the mode out, right?

Correct again.

Sep 27 '24 09:09 FBruzzesi

surprisingly pandas seems to behave quite well

yeah but it returns object dtype and I fear that'd create more issues for us down the line

Sep 27 '24 09:09 MarcoGorelli

yeah but it returns object dtype and I fear that'd create more issues for us down the line

Yes that's not ideal, and yesterday I had issues converting to list type (e.g. .astype('pyarrow[list]') is not enough).

Maybe let's sleep on this, but I would imagine that someone using narwhals should just be a bit more pedantic and do:

(df
.group_by("a")
.agg(nw.col("b").unique()))
.with_columns(nw.col("b").cast(nw.List(...)))  # force it to be list type
... # now can access .list namespace
)

Sep 27 '24 09:09 FBruzzesi

i'm not sure that people would think to do that explicit cast, and implementing the list namespace would be quite difficult for pandas

we may be able to take inspiration from duckdb here, who have any_value as an aggregate function https://duckdb.org/docs/sql/functions/aggregates.html#any_valuearg

>>> rel = duckdb.read_parquet('../scratch/assets.parquet')
>>> duckdb.sql('select symbol, any_value(date) from rel group by symbol')
┌─────────┬─────────────────┐
│ symbol  │ any_value(date) │
│ varchar │      date       │
├─────────┼─────────────────┤
│ EWJ     │ 2022-01-31      │
│ OGN     │ 2022-01-31      │
│ PRU     │ 2022-01-31      │
│ AEP     │ 2022-01-31      │
│ ALLE    │ 2022-01-31      │
│ IEFM.L  │ 2022-01-31      │
│ EWG     │ 2022-01-31      │
│ SEGA.L  │ 2022-01-31      │
│ IAU     │ 2022-01-31      │
│ XLV     │ 2022-01-31      │
│  ·      │     ·           │
│  ·      │     ·           │
│  ·      │     ·           │
│ CNC     │ 2022-01-31      │
│ CTAS    │ 2022-01-31      │
│ DG      │ 2022-01-31      │
│ IEF     │ 2022-05-31      │
│ IEMG    │ 2022-01-31      │
│ JPEA.L  │ 2022-01-31      │
│ META    │ 2022-01-31      │
│ HIGH.L  │ 2022-03-17      │
│ HST     │ 2022-01-31      │
│ VXX     │ 2022-01-31      │
├─────────┴─────────────────┤
│    100 rows (20 shown)    │
└───────────────────────────┘

So, my thinking was that unique_value would be kind of like any_value, but only works if there's only a unique value per group

If it's a top-level function (nw.unique_value) then I think it'd be ok to depart from Polars a bit there, we have other non-Polars function in the top-level narwhals namespace

Sep 27 '24 10:09 MarcoGorelli

Just for clarity, when you say:

Something like this could help address the mode issue you'd spotted in skrub, iirc they just wanted to get a single value from the mode out, right?

does it mean that nw.unique_value('b') can receive an expression (e.g. in the skrub case nw.unique_value(nw.col('b').mode()))?

Sep 27 '24 11:09 FBruzzesi

I haven't tried implementing it yet, but yes, I think so

alternatively, we could add:

nw.unique_value
nw.unique_mode

Alternatively, we could have our own Agg class and do something like

df.group_by('a').agg(nw.Agg.unique_mode('b'))

Sep 27 '24 11:09 MarcoGorelli

I am going to close this for now, we can always come back to it

May 24 '25 09:05 FBruzzesi

fix: pyarrow `unique` in `group_by` context

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below.