mathesar icon indicating copy to clipboard operation
mathesar copied to clipboard

Update design for checking for duplicate rows.

Open kgodey opened this issue 2 years ago • 2 comments

Problem

We have implemented a filter to identify duplicate rows. However, after implementing it, we've realized that it doesn't make sense as a filter like our other filters.

Our other filters work the same regardless of the order in which they are applied. This is not the case for "has duplicates". Depending on the other filters applied, we may end up returning different results and confusing the user. We don't face this problem with any other filter because other filters only rely on the data in a specific row. "Has duplicates" relies on all of the rows visible.

As an example, here's a table:

ID Title Release Year Favorite
1 Dante's Peak 1997 FALSE
2 Forrest Gump 1994 TRUE
3 The Karate Kid 1984 TRUE
4 Dante's Peak 1997 TRUE
5 The Karate Kid 1984 TRUE

Order 1

Imagine the user applies filters in this order: "Year" > 1993, "Favorite" is TRUE, "Title" has duplicates

Filter 1: "Year" > 1993

ID Title Release Year Favorite
1 Dante's Peak 1997 FALSE
2 Forrest Gump 1994 TRUE
4 Dante's Peak 1997 TRUE

Filter 2: "Favorite" is TRUE

ID Title Release Year Favorite
2 Forrest Gump 1994 TRUE
4 Dante's Peak 1997 TRUE

Filter 3:"Title" has duplicates

0 results.

Order 2

But instead if the user applies filter in this order "Year" > 1993,"Title" has duplicates and "Favorite" is TRUE

Filter 1: "Year" > 1993

ID Title Release Year Favorite
1 Dante's Peak 1997 FALSE
2 Forrest Gump 1994 TRUE
4 Dante's Peak 1997 TRUE

Filter 2:"Title" has duplicates

ID Title Release Year Favorite
1 Dante's Peak 1997 FALSE
4 Dante's Peak 1997 TRUE

Filter 3: "Favorite" is TRUE

ID Title Release Year Favorite
4 Dante's Peak 1997 TRUE

Proposed solution

A few ideas:

  • If we want "has duplicates" to be a filter, we need to introduce the concept of ordering and reordering filters and explaining to the user how their previous filters affect future filters. This seems like a complicated concept to introduce for the alpha release so I think it would be better to avoid this solution.
  • We could move the functionality to check for duplicates to some other part of the table/view. This could be extended to other functionality in the future – we could support more complicated database queries through this interface.
  • We could move the functionality to the Query Builder (see #1065)
    • Please note that this needs to be ordered with other steps in the query builder to work properly.

Design notes

  • Please note that this also involves updating the user flow when the user tries to set a UNIQUE constraint on a column and they have non-unique values to be consistent with the new design.

Additional context

kgodey avatar Jan 26 '22 23:01 kgodey

I'll reiterate how I see a transformation like duplicates-only working in the distant future.

In general, to see only duplicates, you have to perform these steps:

  • Look at which rows are unique on some column
  • Group them together
  • Introduce a new column containing the group-size of that row
  • Filter based on whether the group-size is more than 1

At the moment, we are encapsulating all of the above steps into a single monolithic transformation and calling it duplicates_only.

What I'd like to see happening in the future is that we give the user the general tools to perform above listed operations and he can compose them into the exact above sequence and get the same result. That does require/imply this pipeline-type interface that we've been talking about lately.

dmos62 avatar Jan 28 '22 13:01 dmos62

This is blocked by #1065 since we'll probably be using the query builder to identify duplicate rows.

kgodey avatar Feb 15 '22 18:02 kgodey

I'm marking this as unblocked now that #1065 is completed.

seancolsen avatar Sep 01 '22 16:09 seancolsen

Closing this, too old, requirements are likely out of date. We can create a new issue if we need this functionality in the future.

kgodey avatar Feb 02 '24 17:02 kgodey