spark-fast-tests Make the Dataset equality inequality messages better

Here's the current content inequality message:

I think it'd be better to align this output. It'd also be better to put "Actual Content | Expected Content" on a newline.

[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
[info] Actual Content      | Expected Content
[info] [frank,44,us]       | [frank,44,us]
[info] [li,30,china]       | [li,30,china]
[info] [bob,1,uk]          | [bob,1,france]
[info] [camila,5,peru]     | [camila,5,peru]
[info] [maria,19,colombia] | [maria,19,colombia]

It'd be really nice to suppress all the info warnings, but not sure if that's possible with Scalatest.

[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
Actual Content      | Expected Content
[frank,44,us]       | [frank,44,us]
[li,30,china]       | [li,30,china]
[bob,1,uk]          | [bob,1,france]
[camila,5,peru]     | [camila,5,peru]
[maria,19,colombia] | [maria,19,colombia]

Should we get rid of the square brackets for each row of data too?

[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
Actual Content    | Expected Content
frank,44,us       | frank,44,us
li,30,china       | li,30,china
bob,1,uk          | bob,1,france
camila,5,peru     | camila,5,peru
maria,19,colombia | maria,19,colombia

Mar 31 '20 10:03 MrPowers

@carlsverre @nvander1 @gorros - Can you please take a look and provide thoughts on the best error message we can provide users for DataFrame inequality comparisons? Thanks!

Mar 31 '20 11:03 MrPowers

See here for the utest output that doesn't have all the info warnings: https://github.com/MrPowers/spark-fast-tests/issues/64

Mar 31 '20 11:03 MrPowers

I agree.

Apr 01 '20 11:04 gorros

I like this but I would add spaces between the values:

[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
Actual Content      | Expected Content
frank, 44, us       | frank, 44, us
li, 30, china       | li, 30, china

Also consider outputting strings wrapped with " characters so it's obvious which parts of the row is inside and outside the string. Consider the case that a string is empty or contains a comma

Apr 01 '20 22:04 carlsverre

@gorros @carlsverre - Here's a PR to migrate spark-fast-tests back to Scalatest (it's currently using utest): https://github.com/MrPowers/spark-fast-tests/pull/69

I think it'll be easier to develop the optimal Scalatest output if this repo is actually using Scalatest ;)

Let me know your thoughts!

Apr 02 '20 11:04 MrPowers

Here's the current DataFrame comparison message:

Here's the new message (added in this PR):

@carlsverre @gorros @snithish - can you please take a look and let me know if this output looks better / you have any suggestions. Some specific points to note:

I changed the colors for the matching rows from Blue to DarkGray. Do you think that's better? Here's the list of color options.
I needed to prepend "Diffs\n" to get the message to output on a newline in Scalatest. "\n" worked for uTest, but not for Scalatest. I also tried hacking in the null character with "\u0000\n", but Scalatest ignored that too. So looks like we need some sort of real character.

Apr 03 '20 09:04 MrPowers

Good catch - ScalaTest is almost certainly running trim on the string before printing it which will remove all leading/trailing whitespace. I guess a null byte is also considered part of that... If you can get blue to work I think that's better - dark grey can be set to be very similar to the shell bg color in some colorschemes. This looks good to me though - love the new format!

Apr 09 '20 22:04 carlsverre

@MrPowers : Re: brackets around the values, I would recommend keeping those, as it helps to avoid subtle issues around spaces and tabs and such that can affect inequality but be hidden without such delimiters, e.g.

[foo, bar] vs. [foo, bar ]

I agree the alignment is definitely a plus.

I agree with @carlsverre that blue would be better than dark grey in terms of colors for equality.

Apr 24 '20 02:04 khampson

I would love to see something that shows what column values are different. This is especially important for larger data frames that may have 50 columns.

Jun 02 '22 13:06 mikenac

spark-fast-tests spark-fast-tests copied to clipboard

Make the Dataset equality inequality messages better

spark-fast-tests
spark-fast-tests copied to clipboard