mathesar Update design for checking for duplicate rows.

Update design for checking for duplicate rows.

Open kgodey opened this issue 2 years ago • 2 comments

Problem

We have implemented a filter to identify duplicate rows. However, after implementing it, we've realized that it doesn't make sense as a filter like our other filters.

Our other filters work the same regardless of the order in which they are applied. This is not the case for "has duplicates". Depending on the other filters applied, we may end up returning different results and confusing the user. We don't face this problem with any other filter because other filters only rely on the data in a specific row. "Has duplicates" relies on all of the rows visible.

As an example, here's a table:

ID	Title	Release Year	Favorite
1	Dante's Peak	1997	FALSE
2	Forrest Gump	1994	TRUE
3	The Karate Kid	1984	TRUE
4	Dante's Peak	1997	TRUE
5	The Karate Kid	1984	TRUE

Order 1

Imagine the user applies filters in this order: "Year" > 1993, "Favorite" is TRUE, "Title" has duplicates

Filter 1: `"Year" > 1993`

ID	Title	Release Year	Favorite
1	Dante's Peak	1997	FALSE
2	Forrest Gump	1994	TRUE
4	Dante's Peak	1997	TRUE

Filter 2: `"Favorite" is TRUE`

ID	Title	Release Year	Favorite
2	Forrest Gump	1994	TRUE
4	Dante's Peak	1997	TRUE

Filter 3:`"Title" has duplicates`

0 results.

Order 2

But instead if the user applies filter in this order "Year" > 1993,"Title" has duplicates and "Favorite" is TRUE

Filter 1: `"Year" > 1993`

ID	Title	Release Year	Favorite
1	Dante's Peak	1997	FALSE
2	Forrest Gump	1994	TRUE
4	Dante's Peak	1997	TRUE

Filter 2:`"Title" has duplicates`

ID	Title	Release Year	Favorite
1	Dante's Peak	1997	FALSE
4	Dante's Peak	1997	TRUE

Filter 3: `"Favorite" is TRUE`

ID	Title	Release Year	Favorite
4	Dante's Peak	1997	TRUE

Proposed solution

A few ideas:

If we want "has duplicates" to be a filter, we need to introduce the concept of ordering and reordering filters and explaining to the user how their previous filters affect future filters. This seems like a complicated concept to introduce for the alpha release so I think it would be better to avoid this solution.
We could move the functionality to check for duplicates to some other part of the table/view. This could be extended to other functionality in the future – we could support more complicated database queries through this interface.
We could move the functionality to the Query Builder (see #1065)
- Please note that this needs to be ordered with other steps in the query builder to work properly.

Design notes

Please note that this also involves updating the user flow when the user tries to set a UNIQUE constraint on a column and they have non-unique values to be consistent with the new design.

Additional context

"Filters" Concepts page
Technical discussion:
- #846
- Recent meeting notes
- #853

Jan 26 '22 23:01 kgodey

I'll reiterate how I see a transformation like duplicates-only working in the distant future.

In general, to see only duplicates, you have to perform these steps:

Look at which rows are unique on some column
Group them together
Introduce a new column containing the group-size of that row
Filter based on whether the group-size is more than 1

At the moment, we are encapsulating all of the above steps into a single monolithic transformation and calling it duplicates_only.

What I'd like to see happening in the future is that we give the user the general tools to perform above listed operations and he can compose them into the exact above sequence and get the same result. That does require/imply this pipeline-type interface that we've been talking about lately.

Jan 28 '22 13:01 dmos62

This is blocked by #1065 since we'll probably be using the query builder to identify duplicate rows.

Feb 15 '22 18:02 kgodey

I'm marking this as unblocked now that #1065 is completed.

Sep 01 '22 16:09 seancolsen

Closing this, too old, requirements are likely out of date. We can create a new issue if we need this functionality in the future.

Feb 02 '24 17:02 kgodey

mathesar mathesar copied to clipboard

Update design for checking for duplicate rows.

Problem

Order 1

Filter 1: "Year" > 1993

Filter 2: "Favorite" is TRUE

Filter 3:"Title" has duplicates

Order 2

Filter 1: "Year" > 1993

Filter 2:"Title" has duplicates

Filter 3: "Favorite" is TRUE

Proposed solution

Design notes

Additional context

mathesar
mathesar copied to clipboard

Filter 1: `"Year" > 1993`

Filter 2: `"Favorite" is TRUE`

Filter 3:`"Title" has duplicates`

Filter 1: `"Year" > 1993`

Filter 2:`"Title" has duplicates`

Filter 3: `"Favorite" is TRUE`