sqlmesh icon indicating copy to clipboard operation
sqlmesh copied to clipboard

Improve the table diff sample output in CLI

Open izeigerman opened this issue 1 year ago • 10 comments

Currently the printed output is too wide and unreadable even for tables with a handful of columns, since we include columns from both the source and the target table as part of the same row.

Here are some ideas on how to make the output more digestible:

  • Print individual column pairs. This way the width will be bounded to 2 columns that are being compared + join keys.
  • Let user select which columns should be included in the sample.

izeigerman avatar May 21 '24 17:05 izeigerman

Yes 100% @izeigerman

Show sample is useless beyond a certain width. Even if you pipe it to a file, the terminal width based wrapping will still happen. Either have column selection with a sequence of glob patterns using fnmatch or alternatively consider a --tall flag for the sample which does option A.

Also, the current console printer wrapping is annoying in that there is no workaround. Maybe a --plain flag is useful for dumping to a file and you can use print directly to actuate that.

z3z1ma avatar May 21 '24 18:05 z3z1ma

Is there a reason we don't show data diffs side by side like a git diff? We do it in the UI but not for the CLI. I imagine a lot of people want something similar in the CLI.

Example library that does this: https://github.com/paulfitz/daff

image

sungchun12 avatar May 21 '24 19:05 sungchun12

Oh yeah daff looks sick actually @sungchun12 -- it seems absolutely perfect to be honest 👀 🤔

z3z1ma avatar May 21 '24 20:05 z3z1ma

git diff/patch is a good way to look at the problem

z3z1ma avatar May 21 '24 20:05 z3z1ma

Hello there,

I made a data-diff tool for pyspark, and in the process I also made a generic library to create interactive html reports. They are both open source, and the data-diff-viewer does not need Spark (only duckdb-wasm to embed the diff report inside the html). I also started making a similar data-diff based on ibis instead of PySpark, but it's not ready yet.

I would be happy to discuss about this if you want.

FurcyPin avatar May 30 '24 14:05 FurcyPin

@izeigerman you okay with me taking this on?

I have lessons learned fresh on my mind from working on data-diff before it was sunset that I want to use up before those memories fade.

I have a couple improvements that are worth considering such as displaying row counts for: demo: https://www.loom.com/share/b2a421a011854545aafe9f6186f163fc

  • unchanged
  • removed
  • different
  • added

on top of the work you did here: https://github.com/TobikoData/sqlmesh/pull/2644

image

sungchun12 avatar Jun 04 '24 17:06 sungchun12

Just adding to the conversation, but it'd be awesome if the CLI tool could "incrementally" perform the diff with greater and greater degrees of strictness. It would make this TDD curmudgeon very very happy.

For everything Pandas gets wrong, this is one thing they kinda get right, although their error messages are a bit too uninformative at certain steps.

I think my ideal workflow/priority order would go something like column names -> column types -> row count -> primary ID match -> column value match

schlich avatar Jun 06 '24 01:06 schlich

@schlich When you say "incrementally", you want data diffing to have more surgical options to only display column name changes and stop there if it fails some kind of criteria OR are you suggesting execution/display order?

I believe you mean execution/display order, but let me know otherwise!

sungchun12 avatar Jun 06 '24 17:06 sungchun12

well, a little bit of both maybe? i'm also referencing pytest's -x flag that stops at one failure if if you have many. But it's also kind of just a natural progression of "accuracy" as your transformations develop

schlich avatar Jun 06 '24 23:06 schlich

@schlich I disagree with stopping at "failure" because that's an opinion. There are situations where many diffs or few diffs can be a good thing. We're aligned on natural progression though. I'll have to think through if we vastly change the format for that display order because I've been playing with it more and learned some of my suggestions are already covered but with different UX.

sungchun12 avatar Jun 07 '24 00:06 sungchun12