NVTabular icon indicating copy to clipboard operation
NVTabular copied to clipboard

[OP] Add dropDuplicates() operator

Open benfred opened this issue 5 years ago • 2 comments

Issue by rnyak Tuesday May 26, 2020 at 18:00 GMT Originally opened as https://github.com/rapidsai/recsys/issues/172


Is your operator request related to a problem? Please describe.

dropDuplicates() method is used in the Outbrain W&D model, and it is one of the commonly used methods in data preprocessing.

Describe the solution you'd like A clear and concise description of the operation you'd like to perform on the column. Please include:

  • Type (Feature Engineering or Preprocessing): Preprocessing
  • input column type(s): Continuous and categorical
  • output column type(s): Continuous and categorical
  • Expected transformation of the data after application: Return DataFrame with duplicate rows removed.

Additional context cudf has dropDuplicates() method and applied as below:

cdf.drop_duplicates(keep= 'first', inplace=True)

benfred avatar Jun 04 '20 23:06 benfred

Can we get the number of duplicates in the outbrains dataset?

benfred avatar Jul 13 '20 17:07 benfred

We can do drop_duplicates with groupby , but we need groupby with multicolumns

benfred avatar Aug 03 '20 17:08 benfred