polars icon indicating copy to clipboard operation
polars copied to clipboard

Unique / Duplicate Enhancements

Open Julian-J-S opened this issue 2 years ago • 5 comments

Problem description

this is an extension to #5590 with some examples and explanations.

Problems / Issues

  1. unique: There is no way to remove ALL duplicates like in pandas (currently only possible to keep first/last of duplicate row)
  2. unique / is_unique: There is no way to get the same result using unique and filtering with is_unique if duplicates are present which is really awkward
  3. duplicate: There is no consise/semantic way to get the duplicate rows. duplicate should be added for completeness (is_unique + unique available but only is_duplicated)
  4. is_duplicated: There is no keep argument like in pandas duplicated to specify which duplicates to mark
  5. There is no fast / elegant way to get boolean mask of a column containing lists with duplicates. has_duplicates would be nice on Expr.arr (currently using some workaround with unqiue length and original length)

1. unique: add another keep option

  • polars unique cannot remove ALL duplicates from the original data
  • it only has the ability the keep the first or last duplicate row
df = pl.DataFrame({"a": [1, 2, 3, 1]})
df.unique(keep="first")
> [1, 2, 3]
df.unique(keep="last")
> [2, 3, 1]
  • I would like to be able to remove all duplicates
  • pandas has drop_duplicates with keep=False
  • someting like the following would be nice:
df.unique(keep="none") # doesn't exist
> [2, 3]

2. unique and is_unique: inconsistent results

  • polars unique and is_unique NEVER return the same result if duplicates are present which feels awkward:
df = pl.DataFrame({"a": [1, 2, 3, 1]})
df.unique(keep="first") 
> [1, 2, 3]
df.unique(keep="last")
> [2, 3, 1]
df.filter(pl.col("a").is_unique())
> [2, 3]
  • is_unique should also have the keep argument (first/last/none)
  • keep="none" would be the current behavior of is_unique
  • keep="first/last" is like reading a book and marking the first/last time you see a word
df.is_unique(keep="none")  # default / current behavior
> [False, True, True, False]
df.is_unique(keep="first")
> [True, True, True, False]
df.is_unique(keep="last")
> [False, True, True, True]

3. duplicate method

  • with is_unique + unique + is_duplicated available, it feels like duplicate is missing
  • ofc this could be achieved with available methods but having a clean, consice and consistent API is very important imo
  • duplicate should have the same subset and keep arguments as unique
df = pl.DataFrame({"a": [1, 2, 3, 2, 1]})
df.duplicate(keep="all")
> [1, 2, 2, 1]
df.duplicate(keep="first")
> [1, 2]
df.duplicate(keep="last")
> [2, 1]

4. is_duplicated: add keep argument

  • polars is_duplicated is missing the keep argument
  • current behavior:
df = pl.DataFrame({"a": [1, 2, 3, 2, 1]})
df.is_duplicated()
> [True, True, False, True, True]
  • adding keep argument:
df.is_duplicated(keep="all")  # default
> [True, True, False, True, True]
df.is_duplicated(keep="first")
> [True, True, False, False, False]
df.is_duplicated(keep="last")
> [False, False, False, True, True]

5. has_duplicates on Expr.arr

  • currently filtering columns of lists for duplicates is a bit awkward (maybe there is a better way?)
df = pl.DataFrame({'a': [[1, 2, 1], [3, 4, 5], [6, 6]]})
df_arr.filter(
    pl.col('a').arr.lengths()
    != pl.col('a').arr.unique().arr.lengths()
)
> [[1, 2, 1], [6, 6]]
  • has_duplicates would be nice to have on Expr.arr
  • this could also use short-circuiting and be much faster
df_arr.filter(
    pl.col('a').arr.has_duplicates()
)

Summary: Feature Requests

  1. unique: add keep="none" (or None/False) to remove all duplicates (available in pandas drop_duplicates)
  2. is_unique: add keep argument to match unique and get consistent results
  3. duplicate: add this method to complement is_unique, unique and is_duplicated
  4. is_duplicated: add keep argument to match unique (available in pandas duplicated)
  5. Expr.arr.has_duplicates: add this to get fast/efficient boolean mask if list contains duplicates

Julian-J-S avatar Jan 08 '23 20:01 Julian-J-S

We can filter by columns that are unique: is_unique.

df = pl.DataFrame({"a": [1, 2, 3, 1]})
df.filter(pl.col("a").is_unique())

We changed the name from distinct to unique. This discussion has been had before, maybe distinct is a better name. I don't think is_unique and distinct should have the same results. Maybe we should name the method distinct to prevent this semantic confusion.

is_unique should only have one answer, if a column value is unique, meaning there is only one in that set. If you want the first, you combing is_unique and is_first, that's the whole idea of expressions: reduce API surface so that you can combine and cherry pick the logic you need.

The same logic applies to is_duplicated, it should only have a yes or no answer.

  • Expr.arr.has_duplicates: add this to get fast/efficient boolean mask if list contains duplicates

Could you make a separate issue for this one? I think we should add this.

  • unique: add keep="none" (or None/False) to remove all duplicates (available in pandas drop_duplicates)

I think we can add this options :+1:

ritchie46 avatar Jan 09 '23 07:01 ritchie46

is_unique should only have one answer, if a column value is unique

How can I replicate the subset= argument of unique() using is_first, i.e. evaluate uniqueness across multiple columns? As of now, when I pass multiple columns into is_first, it evaluates them independently from each other and returns multiple columns. I could shoehorn something using concat_str, but that is obviously not as efficient as .unique().

I've been scratching my head about this one.

DrMaphuse avatar Jan 30 '23 10:01 DrMaphuse

So have I, to be honest. It's been on my to-do to figure this out. Maybe @ritchie46 can give some clarity about that specific comment?

stinodego avatar Jan 30 '23 10:01 stinodego

We should add is_first support to struct dtypes. Then you can simply wrap the struct and call is_first.

This also is consistent across what we tell everybody to do. If you want to evaluate logic on multiple columns -> wrap it in a struct.

I can pick this up.

ritchie46 avatar Jan 30 '23 10:01 ritchie46

In my head, it would make sense to add pl.is_first(), analogous to pl.sum().

Implementation of is_first for pl.list dtype could also give us a potential solution.

DrMaphuse avatar Jan 30 '23 10:01 DrMaphuse