polars icon indicating copy to clipboard operation
polars copied to clipboard

Add duplicate detection to columns with lists (`Expr.arr.has_duplicates`)

Open Julian-J-S opened this issue 2 years ago • 0 comments

Problem description

@ritchie46 (as discussed I created a separate issue)

There is no fast / elegant way to get boolean mask of a column containing lists with duplicates. has_duplicates would be nice on Expr.arr. This could also use short-circuiting on the first duplicate and be much faster

currently using a workaround:

df = pl.DataFrame({'a': [[1, 2, 1], [3, 4, 5], [6, 6]]})
df.filter(
    pl.col('a').arr.lengths()
    != pl.col('a').arr.unique().arr.lengths()
)
> [[1, 2, 1], [6, 6]]

this would be nice

df.filter(
    pl.col('a').arr.has_duplicates()  # does not exist
)

not sure if this makes sense only in the arr namespace or also more generally on Expr to enable the following (might be a little confusing because is generates a boolean mask in the arr namespace but is reduced to a single boolean otherwise :/ ):

df = pl.DataFrame({"a": [1, 2, 3, 1]})
df.with_column(
    pl.when(
        # ~pl.col("a").is_unique().all()     # option 1
        # pl.col("a").is_duplicated().any()  # option 2
        pl.col("a").has_duplicates()         # not implemented, more intuitive & faster
    )
    .then("col a has duplicates")
    .otherwise("col a has no duplicates")
)

Julian-J-S avatar Jan 09 '23 10:01 Julian-J-S