enso icon indicating copy to clipboard operation
enso copied to clipboard

New find_duplicates component

Open AdRiley opened this issue 1 year ago • 1 comments

Use case:

I have a dataset that I expect to be distinct on a number of columns. But it is not. I want to look at all of the non distinct or duplicate records to find out why they are not distinct.

image

Proposed new component find_duplicates or find_non_distinct.

Takes a table and a vector of columnsto find the duplicates on. Returns all records that have duplicates across the selected columns.

Table.find_duplicates self group_by = any3 = self.aggregate group_by [(Aggregate_Column.Count 'InternalCount')] any4 = any3.filter 'InternalCount' (Filter_Condition.Not_Equal 1) table4 = self.join any4 Join_Kind.Inner group_by any6 = table4.remove_columns ['InternalCount'] any6

AdRiley avatar Jun 04 '24 11:06 AdRiley

I think the signature should be a mirror of distinct:

    duplicates : Vector (Integer | Text | Regex) | Text | Integer | Regex -> Case_Sensitivity -> Boolean -> Problem_Behavior -> Table ! No_Output_Columns | Missing_Input_Columns | No_Input_Columns_Selected | Floating_Point_Equality
    duplicates self (columns = self.column_names) case_sensitivity=Case_Sensitivity.Default error_on_missing_columns=True on_problems=Report_Warning =

We should consider whether Vector.duplicates should be aligned so it returns all duplicates (think returns the unique one currently).

jdunkerley avatar Jun 04 '24 11:06 jdunkerley