[OP] Add dropDuplicates() operator
Issue by rnyak
Tuesday May 26, 2020 at 18:00 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/172
Is your operator request related to a problem? Please describe.
dropDuplicates() method is used in the Outbrain W&D model, and it is one of the commonly used methods in data preprocessing.
Describe the solution you'd like A clear and concise description of the operation you'd like to perform on the column. Please include:
- Type (Feature Engineering or Preprocessing): Preprocessing
- input column type(s): Continuous and categorical
- output column type(s): Continuous and categorical
- Expected transformation of the data after application: Return DataFrame with duplicate rows removed.
Additional context cudf has dropDuplicates() method and applied as below:
cdf.drop_duplicates(keep= 'first', inplace=True)
Can we get the number of duplicates in the outbrains dataset?
We can do drop_duplicates with groupby , but we need groupby with multicolumns