visidata icon indicating copy to clipboard operation
visidata copied to clipboard

Merge has surprising result

Open reagle opened this issue 3 years ago • 18 comments

I'm creating toy examples for an Anki deck and find unexpected behavior for merge, which I thought might mean: "keep all rows from first sheet and matching rows from other sheets, updating the first sheet's data with the first non-False value." (The intro says, "Merges differences from other sheets into first sheet"; the quick reference says, "merge mostly keep all rows from first selected sheet, except prioritise cells with non-null/non-error values.")

ID	Name	Age
1	Alice	25
2	Bob	30
3	Carol	
4	David	40
ID	Name	Age
2		35
3		30
5	Emily	28

After setting ID as the key column, when making the first sheet the uppermost (shift+k) and merging (& merge), I'd expect:

ID	Name	Age
1	Alice	25
2	Bob	35
3	Carol	30
4	David	40

But I get:

ID	Name	Age
1	Alice	25
2	Bob	30
3	Carol	30
4	David	40
5	Emily	28

I'm surprised Emily is included and that Bob's age is not 35.

The text files are in this zip. Using v2.11.

test.zip

reagle avatar Apr 07 '23 13:04 reagle

BTW: the fact that Bob and Carol's name are not removed is intuitive: I imagine because empty/null value isNull or Falsey. So that's good/expected.

reagle avatar Apr 07 '23 13:04 reagle

I'm surprised Emily is included

I was looking at the example data set from the original feature request in #405. It seems the implementation matches the example provided the feature request: rows 8, 10, and 11 are not present in the initial table, are present in the new table, and then are present in the merged table.

I'm curious to hear what you think the correct intended behavior should be for new rows. Or maybe the documentation should be updated to clarify this behavior?

... and that Bob's age is not 35

I agree and also expected that Bob's age to be 35 instead of 30 in the merged table. I would be interested in taking a stab at this bugfix.

yphillip avatar Jun 18 '23 01:06 yphillip

@yphillip If you wanted to take a stab at it, that would be much appreciated! I'd be happy to review at any point.

saulpw avatar Jun 18 '23 02:06 saulpw

@yphillip I don't feel strongly what the behavior should be; my concern is that it do what a clear description says it will do. I think it makes sense to follow well-established behaviors/conventions (e.g., from SQL).

reagle avatar Jun 19 '23 12:06 reagle

I’m glad to see this addressed, but I’m not sure what the result is. Does the documentation need to be updated?

reagle avatar Oct 21 '23 00:10 reagle

I'm relying on @yphillip's PR #1923 making a reasonable fix, but I haven't taken a deep look into it. Would you be up for checking out the join-merge test and seeing if it does as you expect now? It looks like the fix was to just reverse the order of columns when calculating the merged values.

saulpw avatar Oct 21 '23 01:10 saulpw

I'm surprised Emily is included and that Bob's age is not 35.

Post-bugfix, the result is that Bob's age is now 35 (as you were expecting) and Emily is included (which is not what you were expecting).

ID	Name	Age
1	Alice	25
2	Bob	35
3	Carol	30
4	David	40
5	Emily	28

Emily is included because it seems like that's the behavior that the original feature request described: https://github.com/saulpw/visidata/issues/405 (the join-merge test case)

Left table does not have rows with colA 8, 10, and 11, Right table does have rows with colA 8, 10, and 11 join-merged table does have rows with colA 8, 10, and 11

To summarize, this is the behavior now:

"keep all rows from first sheet and matching rows from other sheets, updating the first sheet's data with the first non-False value. Also, add new rows from second sheet that were not present in first sheet."

Does the documentation need to be updated?

Probably a good idea so that users know exactly what merge does without having to read these issues and PR conversations :) Will leave it up to you all on how/when to update the documentation.

yphillip avatar Oct 22 '23 02:10 yphillip

@jreagle Would you be willing to update the documentation for this issue? I think you might have some good ideas about how to communicate this, especially since you stumbled on it originally and might remember how you looked for the information.

saulpw avatar Nov 02 '23 01:11 saulpw

@saulpw, I was originally pulling from two pieces of documentation:

  1. I believe @anjakefala created/maintains "quick reference"; I thought the doc was maintained in groff, though I see a a file here with a markdown extension and whose content is mostly HTML here. Perhaps they can speak to how to update that doc?
  2. @jsvine maintains the introduction, which says it takes its direction from the quick reference. Perhaps once the quick reference is addressed, they might be willing to give a bit more of a description using the new understanding?

reagle avatar Nov 02 '23 13:11 reagle

@reagle: Thanks and yes, I'd be happy to tweak the tutorial's description of the merge-join once the quick reference language is finalized.

jsvine avatar Nov 02 '23 16:11 jsvine

@reagle yes, the quick reference is old-school manpage groff format, though it's such a pain to maintain that we should probably migrate to a simpler if less featureful source format. For now though, if we know what we want it to say, @anjakefala should be able to update the actual source file in short order. So it sounds like we basically just need the wording that should go in the manpage/quickref guide, and then that will trickle down into the tutorial.

saulpw avatar Nov 02 '23 19:11 saulpw

@reagle Just to add that the .md you found is actually created from the groff file (https://github.com/saulpw/visidata/blob/develop/dev/mkman.sh). I put that together 6 years ago, it was my first major VisiData contribution. If I built it today, I would not use groff, and would have tried to find a way to avoid checking in generated files. =)

But yes! Let me know the text, and I will add it to the manpage.

anjakefala avatar Nov 02 '23 20:11 anjakefala

@anjakefala, @yphillip wrote "keep all rows from first sheet and matching rows from other sheets, updating the first sheet's data with the first non-False value. Also, add new rows from second sheet that were not present in first sheet." I know that's more than twice as long as the current prose, but I'm hesitant to edit for fear of changing it. If you need it shorter, I'd defer to @yphillip .

reagle avatar Nov 02 '23 20:11 reagle

@reagle @jsvine Updated! https://github.com/saulpw/visidata/commit/1599b9193eac4e3c1f3d67e700235b7ca1551f8d

anjakefala avatar Nov 03 '23 01:11 anjakefala

Thanks! Just to make sure I'm translating correctly for the tutorial: What counts as "False-y"? As I understand it, different languages/contexts define this in different ways, yeah? E.g., is the integer 0 False-y?

jsvine avatar Nov 03 '23 19:11 jsvine

@jsvine It's currently the Python False-y (https://github.com/saulpw/visidata/blob/develop/visidata/features/join.py#L159). We did because we didn't want an empty string to take precedent; this may have been before options.null_value, but that still seems unwieldy for this join. I'm open to discussion, but for now let's document the 0/False/empty string/empty list behavior.

saulpw avatar Nov 08 '23 04:11 saulpw

Thanks, and now updated: https://github.com/jsvine/intro-to-visidata/commit/c32818517c31ade31d05eef116bff4baa96da376

jsvine avatar Nov 17 '23 19:11 jsvine

BTW: I noticed in my Anki tutorial that the description did not match the examples: it showed Bob's age was 35, not 30. I've fixed this, but I think the in-program hint still needs help.

current: "merge - merge differences from other sheets into first sheet (including new rows)"

"difference" is too inclusive; I wrestled with something more accurate, and I think this is, though it is also a little longer:

proposed: "merge - all rows from all sheets; where keys match, update first sheet's Falsey values"

reagle avatar Jul 12 '24 19:07 reagle

This is a tricky one to explain. The merge join uses the most recent truthy value; if muiltiple sheets are joined, the latest truthy value is used, even if earlier sheets had truthy values. So, more than just 'False-y' values are updated.

I propose "'all rows from all sheets; where keys match, use latest truthy value". What do we think about this?

saulpw avatar Sep 05 '24 21:09 saulpw

Very good! BTW: Looking around, it seems that Falsy (without an 'e') is the common spelling.

reagle avatar Sep 06 '24 12:09 reagle

While we're on the topic of joins, there is another surprising behavior that I'd like to see documented: how key columns are matched for inner/outer joins.

When I first used joins, I expected that key column matches would be based on column names. For example, if a sheet has two key columns:

keyYear | keyMonth || TransactionCount

and I am joining it to another, I thought the join would be made by finding key columns called keyYear and keyMonth.

Instead, it seems like visidata just matches by the key columns by lining them up in order, regardless of name: the 1st key column value in sheet A must match the value under 1st key column in sheet B; the value in the 2nd column in sheet A has to match the value in the 2nd column in sheet B, etc.

I think these comparisons should be explained somewhere more prominently.

In particular, there is a join that I naively expected would work: joining two sheets with key columns that have identical names, but in different order in each sheet.

midichef avatar Sep 07 '24 04:09 midichef

I wanted people to be able to join without having to rename columns. We could make it check for same-named columns first, and only fall-back to column order if all names don't match. Does that seem reasonable?

saulpw avatar Sep 07 '24 04:09 saulpw

That surprises me too @midichef, but I’ve only done single column joins so didn’t encounter this.

reagle avatar Sep 07 '24 10:09 reagle

I wanted people to be able to join without having to rename columns. We could make it check for same-named columns first, and only fall-back to column order if all names don't match. Does that seem reasonable?

Yes, that's a good idea, and I think that'll be useful.

midichef avatar Sep 13 '24 19:09 midichef