fix: pyarrow `unique` in `group_by` context
What type of PR is this? (check all applicable)
- [ ] ๐พ Refactor
- [x] โจ Feature
- [x] ๐ Bug Fix
- [ ] ๐ง Optimization
- [ ] ๐ Documentation
- [ ] โ Test
- [ ] ๐ณ Other
Related issues
May come in handy for plotly
Checklist
- [x] Code follows style guide (ruff)
- [ ] Tests added
- [ ] Documented the changes
If you have comments or can explain your changes, please do so below.
I was not able to add tests... I tried to nest a bunch of checks but also the order inside the list type is not guaranteed.. any idea?
thanks @FBruzzesi !
I think plotly would only need to get some value from the aggregation, rather than a list dtypes?
perhaps we could allow
df.group_by('a').agg(nw.unique_value('b')) # raises if there's more than 1 unique value per group
df.group_by('a').agg(nw.unique_value('b', fallback_value='(?)'))
Something like this could help address the mode issue you'd spotted in skrub, iirc they just wanted to get a single value from the mode out, right?
Not sure if this is the right place for this discussion but here we go ๐
I think plotly would only need to get some value from the aggregation, rather than a list dtypes?
Yes correct!
perhaps we could allow
df.group_by('a').agg(nw.unique_value('b')) # raises if there's more than 1 unique value per group df.group_by('a').agg(nw.unique_value('b', fallback_value='(?)'))
Not the biggest fan of this if we are going to support nw.List type - a list is an aggregated value, and surprisingly pandas seems to behave quite well - at least for .unique ๐
Something like this could help address the
modeissue you'd spotted in skrub, iirc they just wanted to get a single value from the mode out, right?
Correct again.
surprisingly pandas seems to behave quite well
yeah but it returns object dtype and I fear that'd create more issues for us down the line
yeah but it returns object dtype and I fear that'd create more issues for us down the line
Yes that's not ideal, and yesterday I had issues converting to list type (e.g. .astype('pyarrow[list]') is not enough).
Maybe let's sleep on this, but I would imagine that someone using narwhals should just be a bit more pedantic and do:
(df
.group_by("a")
.agg(nw.col("b").unique()))
.with_columns(nw.col("b").cast(nw.List(...))) # force it to be list type
... # now can access .list namespace
)
i'm not sure that people would think to do that explicit cast, and implementing the list namespace would be quite difficult for pandas
we may be able to take inspiration from duckdb here, who have any_value as an aggregate function https://duckdb.org/docs/sql/functions/aggregates.html#any_valuearg
>>> rel = duckdb.read_parquet('../scratch/assets.parquet')
>>> duckdb.sql('select symbol, any_value(date) from rel group by symbol')
โโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ
โ symbol โ any_value(date) โ
โ varchar โ date โ
โโโโโโโโโโโผโโโโโโโโโโโโโโโโโโค
โ EWJ โ 2022-01-31 โ
โ OGN โ 2022-01-31 โ
โ PRU โ 2022-01-31 โ
โ AEP โ 2022-01-31 โ
โ ALLE โ 2022-01-31 โ
โ IEFM.L โ 2022-01-31 โ
โ EWG โ 2022-01-31 โ
โ SEGA.L โ 2022-01-31 โ
โ IAU โ 2022-01-31 โ
โ XLV โ 2022-01-31 โ
โ ยท โ ยท โ
โ ยท โ ยท โ
โ ยท โ ยท โ
โ CNC โ 2022-01-31 โ
โ CTAS โ 2022-01-31 โ
โ DG โ 2022-01-31 โ
โ IEF โ 2022-05-31 โ
โ IEMG โ 2022-01-31 โ
โ JPEA.L โ 2022-01-31 โ
โ META โ 2022-01-31 โ
โ HIGH.L โ 2022-03-17 โ
โ HST โ 2022-01-31 โ
โ VXX โ 2022-01-31 โ
โโโโโโโโโโโดโโโโโโโโโโโโโโโโโโค
โ 100 rows (20 shown) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
So, my thinking was that unique_value would be kind of like any_value, but only works if there's only a unique value per group
If it's a top-level function (nw.unique_value) then I think it'd be ok to depart from Polars a bit there, we have other non-Polars function in the top-level narwhals namespace
Just for clarity, when you say:
Something like this could help address the mode issue you'd spotted in skrub, iirc they just wanted to get a single value from the mode out, right?
does it mean that nw.unique_value('b') can receive an expression (e.g. in the skrub case nw.unique_value(nw.col('b').mode()))?
I haven't tried implementing it yet, but yes, I think so
alternatively, we could add:
nw.unique_valuenw.unique_mode
Alternatively, we could have our own Agg class and do something like
df.group_by('a').agg(nw.Agg.unique_mode('b'))
I am going to close this for now, we can always come back to it