dask icon indicating copy to clipboard operation
dask copied to clipboard

Support drop_duplicates(subset=...)

Open mrocklin opened this issue 2 years ago • 1 comments

(This is a little ramble-y. Sorry.)

Probably we could make drop_duplicates more scalable in the case where folks subset on columns. We would still probably have to fit information about every row for the relevant columns in memory, but this is quite different from reducing the entire dataset (all columns) to a single partition.

My thinking is that for each value of the subsetted columns we want to know that only a single partition will hold onto that value. One way to do this would be to build up a mapping of value to partition number like the following:

df.drop_duplicates(subset=["id"])
{ 
    (100,): 3,
    (101,): 0,
    (102,): 5,
    ...
}

Where the keys are values and the values are the partition index in which they should be allowed. Building this mapping requires a full reduction of this column to a single dataframe (or we could split things up if we wanted to I guess, similar to split_out). Actually that could probably be a pandas DataFrame with a multi-index?.

Then we would need to broadcast that data out to all of the partitions again.

mrocklin avatar Jun 26 '23 19:06 mrocklin

The proposed algorithm very likely would also allow us to implement duplicated, see https://github.com/dask/dask/issues/1854

If #1854 is solved, the drop would merely be an additional filter step.

fjetter avatar Jul 03 '23 08:07 fjetter