spark-fast-tests icon indicating copy to clipboard operation
spark-fast-tests copied to clipboard

Make the Dataset equality inequality messages better

Open MrPowers opened this issue 4 years ago • 9 comments

Here's the current content inequality message:

Screen Shot 2020-03-31 at 5 33 18 AM

I think it'd be better to align this output. It'd also be better to put "Actual Content | Expected Content" on a newline.

[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
[info] Actual Content      | Expected Content
[info] [frank,44,us]       | [frank,44,us]
[info] [li,30,china]       | [li,30,china]
[info] [bob,1,uk]          | [bob,1,france]
[info] [camila,5,peru]     | [camila,5,peru]
[info] [maria,19,colombia] | [maria,19,colombia]

It'd be really nice to suppress all the info warnings, but not sure if that's possible with Scalatest.

[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
Actual Content      | Expected Content
[frank,44,us]       | [frank,44,us]
[li,30,china]       | [li,30,china]
[bob,1,uk]          | [bob,1,france]
[camila,5,peru]     | [camila,5,peru]
[maria,19,colombia] | [maria,19,colombia]

Should we get rid of the square brackets for each row of data too?

[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
Actual Content    | Expected Content
frank,44,us       | frank,44,us
li,30,china       | li,30,china
bob,1,uk          | bob,1,france
camila,5,peru     | camila,5,peru
maria,19,colombia | maria,19,colombia

MrPowers avatar Mar 31 '20 10:03 MrPowers

@carlsverre @nvander1 @gorros - Can you please take a look and provide thoughts on the best error message we can provide users for DataFrame inequality comparisons? Thanks!

MrPowers avatar Mar 31 '20 11:03 MrPowers

See here for the utest output that doesn't have all the info warnings: https://github.com/MrPowers/spark-fast-tests/issues/64

MrPowers avatar Mar 31 '20 11:03 MrPowers

I agree.

gorros avatar Apr 01 '20 11:04 gorros

I like this but I would add spaces between the values:

[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
Actual Content      | Expected Content
frank, 44, us       | frank, 44, us
li, 30, china       | li, 30, china

Also consider outputting strings wrapped with " characters so it's obvious which parts of the row is inside and outside the string. Consider the case that a string is empty or contains a comma

carlsverre avatar Apr 01 '20 22:04 carlsverre

@gorros @carlsverre - Here's a PR to migrate spark-fast-tests back to Scalatest (it's currently using utest): https://github.com/MrPowers/spark-fast-tests/pull/69

I think it'll be easier to develop the optimal Scalatest output if this repo is actually using Scalatest ;)

Let me know your thoughts!

MrPowers avatar Apr 02 '20 11:04 MrPowers

Here's the current DataFrame comparison message:

Screen Shot 2020-04-02 at 5 03 56 PM

Here's the new message (added in this PR):

Screen Shot 2020-04-02 at 5 05 03 PM

@carlsverre @gorros @snithish - can you please take a look and let me know if this output looks better / you have any suggestions. Some specific points to note:

  • I changed the colors for the matching rows from Blue to DarkGray. Do you think that's better? Here's the list of color options.
  • I needed to prepend "Diffs\n" to get the message to output on a newline in Scalatest. "\n" worked for uTest, but not for Scalatest. I also tried hacking in the null character with "\u0000\n", but Scalatest ignored that too. So looks like we need some sort of real character.

MrPowers avatar Apr 03 '20 09:04 MrPowers

Good catch - ScalaTest is almost certainly running trim on the string before printing it which will remove all leading/trailing whitespace. I guess a null byte is also considered part of that... If you can get blue to work I think that's better - dark grey can be set to be very similar to the shell bg color in some colorschemes. This looks good to me though - love the new format!

carlsverre avatar Apr 09 '20 22:04 carlsverre

@MrPowers : Re: brackets around the values, I would recommend keeping those, as it helps to avoid subtle issues around spaces and tabs and such that can affect inequality but be hidden without such delimiters, e.g.

[foo, bar] vs. [foo, bar ]

I agree the alignment is definitely a plus.

I agree with @carlsverre that blue would be better than dark grey in terms of colors for equality.

khampson avatar Apr 24 '20 02:04 khampson

I would love to see something that shows what column values are different. This is especially important for larger data frames that may have 50 columns.

mikenac avatar Jun 02 '22 13:06 mikenac