polars
polars copied to clipboard
Add duplicate detection to columns with lists (`Expr.arr.has_duplicates`)
Problem description
@ritchie46 (as discussed I created a separate issue)
There is no fast / elegant way to get boolean mask of a column containing lists with duplicates. has_duplicates
would be nice on Expr.arr
. This could also use short-circuiting on the first duplicate and be much faster
currently using a workaround:
df = pl.DataFrame({'a': [[1, 2, 1], [3, 4, 5], [6, 6]]})
df.filter(
pl.col('a').arr.lengths()
!= pl.col('a').arr.unique().arr.lengths()
)
> [[1, 2, 1], [6, 6]]
this would be nice
df.filter(
pl.col('a').arr.has_duplicates() # does not exist
)
not sure if this makes sense only in the arr
namespace or also more generally on Expr
to enable the following (might be a little confusing because is generates a boolean mask in the arr
namespace but is reduced to a single boolean otherwise :/ ):
df = pl.DataFrame({"a": [1, 2, 3, 1]})
df.with_column(
pl.when(
# ~pl.col("a").is_unique().all() # option 1
# pl.col("a").is_duplicated().any() # option 2
pl.col("a").has_duplicates() # not implemented, more intuitive & faster
)
.then("col a has duplicates")
.otherwise("col a has no duplicates")
)