vaex icon indicating copy to clipboard operation
vaex copied to clipboard

Drop Duplicates (Simple POC)

Open Alon-Alexander opened this issue 2 years ago • 10 comments

This is a simple implementation of a drop duplicates functionality using existing sources. It only supports saving the first duplicated element.

In the future (or now), we may want to add more aggregators (like last) in order to support a more diverse drop-duplicates functionality.

Alon-Alexander avatar Oct 08 '21 14:10 Alon-Alexander

What is the expected memory usage? Let me test it again this evening with the big dataset and will report back.

JovanVeljanoski avatar Nov 03 '21 08:11 JovanVeljanoski

It's a groupby on all colums, so if everything is unique, we at least make a whole copy of the dataframe (ignoring the temporary cost of making all the hashmaps)

maartenbreddels avatar Nov 03 '21 09:11 maartenbreddels

@JovanVeljanoski what about putting a warning in the docstring?

maartenbreddels avatar Dec 07 '21 07:12 maartenbreddels

This is great and a much needed feature. However since it uses groupby, it is not trully "out-of-core" because the result is in memory (the output of groupby). And it does work very nice if you only have 1 or 2 keys (columns) based on which you wanna drop duplicates, but if you want to do it per row for big datasets (with many columns) that is like doing groupby on all columns, and memory explodes.

For these reasons I prefer this not to be part of vaex-core where I think everything should regardless of the data size. Groupby itself is a but of a special case, assuming aggregations, but it does return a fully new dataframe. For dropduplicates, like dropna for example, the naive expectation is that you get a shallow copy of a dataframe, and the work to be lazily done.

In any case, since this is an important feature, I would be happy for this to go to the vaex-contrib package, with a user warning stating that how this actually works and that the final output will be in memory.

So @Alon-Alexander if you can please move this to vaex-contrib, i think we can merge it quickly.

Thanks!

JovanVeljanoski avatar Dec 16 '21 10:12 JovanVeljanoski

So how do I use vaex-contrib if I wanted to test this out?

godsmustbcrazy avatar May 06 '22 21:05 godsmustbcrazy

I dont think this was every added to vaex-contrib.

So if you want to try it out, you can either check out this branch (quite out of date by now ? ), or see the implementation, it is quite simple, and try to do it yourself... with all the warnings discussed above.

JovanVeljanoski avatar May 07 '22 01:05 JovanVeljanoski

Thanks, I will check it out. I am bypassing this for now by switching back and forth between pandas and vaex df. It works pretty well for my usecase as I have a quite a few duplicates by different combinations that I need to eriminate.

godsmustbcrazy avatar May 07 '22 16:05 godsmustbcrazy

Hi, sorry to be waking this up again, but I recently started trying to use vaex and found that this is an incredibly powerful tool, with only a few things missing for making it a perfect tool for regular usage in my work. One of the things that I currently use pandas for is dropping duplicates, and even though I followed this thread to implement the drop duplicates workaround, it still does not perform what I want to do on vaex.

Namely, I would like to drop duplicates based on a couple of selected columns, BUT retain all existing columns in the dataframe. Right now if I use

def drop_duplicates(self, columns=None):
    """Return a :class:`DataFrame` object with no duplicates in the given columns.
    .. warning:: The resulting dataframe will be in memory, use with caution.
    :param columns: Column or list of column to remove duplicates by, default to all columns.
    :return: :class:`DataFrame` object with duplicates filtered away.
    """
    if columns is None:
        columns = self.get_column_names()
    if type(columns) is str:
        columns = [columns]

    return self.groupby(columns, agg={'__hidden_count': vaex.agg.count()}).drop('__hidden_count')

with let's say test = vaexdf.drop_duplicates(columns=["col1", "col2"]), what I get back is a dataframe with only col1 and col2, i.e.:

>>>test.get_columns_names()
['col1', 'col2']

Is there a way (even if it's hacky) to get out the full dataframe?

Thank you for developing this amazing tool!

henrikas-svidras avatar May 30 '22 12:05 henrikas-svidras

This implementation does not preserve other columns like pandas did. Here is my workaround trying to mimic pandas, but it's not "lazy" tho

def vaex_drop_duplicates(df, subset, keep="first"):
    df["index"] = vaex.vrange(0, len(df), dtype=np.int64)
    if keep == "first":
        idxToKeep = df.groupby(subset, agg={"keep": vaex.agg.min("index")})["keep"].to_numpy()
    elif keep == "last":
        idxToKeep = df.groupby(subset, agg={"keep": vaex.agg.max("index")})["keep"].to_numpy()
    elif keep == False:
        idxToKeep = df.groupby(subset, agg={"keep": vaex.agg.min("index"), "count": vaex.agg.count("index")})
        idxToKeep = idxToKeep[idxToKeep["count"] == 1]["keep"].to_numpy()
    return df[df.index.isin(idxToKeep)]

cgjosephlee avatar Oct 20 '22 10:10 cgjosephlee

Hi, I tried this implementation and got a Memory Error. Is there any plan or suggestion how to overcome this?

Thanks Jonathan

jhexner avatar May 22 '23 09:05 jhexner