polars
polars copied to clipboard
Unique / Duplicate Enhancements
Problem description
this is an extension to #5590 with some examples and explanations.
Problems / Issues
-
unique
: There is no way to remove ALL duplicates like in pandas (currently only possible to keepfirst
/last
of duplicate row) -
unique
/is_unique
: There is no way to get the same result usingunique
and filtering withis_unique
if duplicates are present which is really awkward -
duplicate
: There is no consise/semantic way to get the duplicate rows.duplicate
should be added for completeness (is_unique
+unique
available but onlyis_duplicated
) -
is_duplicated
: There is nokeep
argument like in pandasduplicated
to specify which duplicates to mark - There is no fast / elegant way to get boolean mask of a column containing lists with duplicates.
has_duplicates
would be nice onExpr.arr
(currently using some workaround withunqiue
length and original length)
1. unique
: add another keep
option
- polars
unique
cannot remove ALL duplicates from the original data - it only has the ability the keep the first or last duplicate row
df = pl.DataFrame({"a": [1, 2, 3, 1]})
df.unique(keep="first")
> [1, 2, 3]
df.unique(keep="last")
> [2, 3, 1]
- I would like to be able to remove all duplicates
- pandas has drop_duplicates with
keep=False
- someting like the following would be nice:
df.unique(keep="none") # doesn't exist
> [2, 3]
2. unique
and is_unique
: inconsistent results
- polars
unique
andis_unique
NEVER return the same result if duplicates are present which feels awkward:
df = pl.DataFrame({"a": [1, 2, 3, 1]})
df.unique(keep="first")
> [1, 2, 3]
df.unique(keep="last")
> [2, 3, 1]
df.filter(pl.col("a").is_unique())
> [2, 3]
-
is_unique
should also have thekeep
argument (first
/last
/none
) -
keep="none"
would be the current behavior ofis_unique
-
keep="first/last"
is like reading a book and marking the first/last time you see a word
df.is_unique(keep="none") # default / current behavior
> [False, True, True, False]
df.is_unique(keep="first")
> [True, True, True, False]
df.is_unique(keep="last")
> [False, True, True, True]
3. duplicate
method
- with
is_unique
+unique
+is_duplicated
available, it feels likeduplicate
is missing - ofc this could be achieved with available methods but having a clean, consice and consistent API is very important imo
-
duplicate
should have the samesubset
andkeep
arguments asunique
df = pl.DataFrame({"a": [1, 2, 3, 2, 1]})
df.duplicate(keep="all")
> [1, 2, 2, 1]
df.duplicate(keep="first")
> [1, 2]
df.duplicate(keep="last")
> [2, 1]
4. is_duplicated
: add keep
argument
- polars
is_duplicated
is missing thekeep
argument - current behavior:
df = pl.DataFrame({"a": [1, 2, 3, 2, 1]})
df.is_duplicated()
> [True, True, False, True, True]
- adding
keep
argument:
df.is_duplicated(keep="all") # default
> [True, True, False, True, True]
df.is_duplicated(keep="first")
> [True, True, False, False, False]
df.is_duplicated(keep="last")
> [False, False, False, True, True]
5. has_duplicates
on Expr.arr
- currently filtering columns of lists for duplicates is a bit awkward (maybe there is a better way?)
df = pl.DataFrame({'a': [[1, 2, 1], [3, 4, 5], [6, 6]]})
df_arr.filter(
pl.col('a').arr.lengths()
!= pl.col('a').arr.unique().arr.lengths()
)
> [[1, 2, 1], [6, 6]]
-
has_duplicates
would be nice to have onExpr.arr
- this could also use short-circuiting and be much faster
df_arr.filter(
pl.col('a').arr.has_duplicates()
)
Summary: Feature Requests
-
unique
: addkeep="none"
(orNone/False
) to remove all duplicates (available in pandasdrop_duplicates
) -
is_unique
: addkeep
argument to matchunique
and get consistent results -
duplicate
: add this method to complementis_unique
,unique
andis_duplicated
-
is_duplicated
: addkeep
argument to matchunique
(available in pandasduplicated
) -
Expr.arr.has_duplicates
: add this to get fast/efficient boolean mask if list contains duplicates
We can filter by columns that are unique: is_unique
.
df = pl.DataFrame({"a": [1, 2, 3, 1]})
df.filter(pl.col("a").is_unique())
We changed the name from distinct
to unique
. This discussion has been had before, maybe distinct
is a better name. I don't think is_unique
and distinct
should have the same results. Maybe we should name the method distinct
to prevent this semantic confusion.
is_unique
should only have one answer, if a column value is unique, meaning there is only one in that set. If you want the first, you combing is_unique
and is_first
, that's the whole idea of expressions: reduce API surface so that you can combine and cherry pick the logic you need.
The same logic applies to is_duplicated
, it should only have a yes or no answer.
- Expr.arr.has_duplicates: add this to get fast/efficient boolean mask if list contains duplicates
Could you make a separate issue for this one? I think we should add this.
- unique: add keep="none" (or None/False) to remove all duplicates (available in pandas drop_duplicates)
I think we can add this options :+1:
is_unique
should only have one answer, if a column value is unique
How can I replicate the subset=
argument of unique()
using is_first
, i.e. evaluate uniqueness across multiple columns? As of now, when I pass multiple columns into is_first
, it evaluates them independently from each other and returns multiple columns. I could shoehorn something using concat_str
, but that is obviously not as efficient as .unique()
.
I've been scratching my head about this one.
So have I, to be honest. It's been on my to-do to figure this out. Maybe @ritchie46 can give some clarity about that specific comment?
We should add is_first
support to struct
dtypes. Then you can simply wrap the struct
and call is_first
.
This also is consistent across what we tell everybody to do. If you want to evaluate logic on multiple columns -> wrap it in a struct.
I can pick this up.
In my head, it would make sense to add pl.is_first()
, analogous to pl.sum()
.
Implementation of is_first
for pl.list
dtype could also give us a potential solution.