spark-fast-tests
spark-fast-tests copied to clipboard
Make the Dataset equality inequality messages better
Here's the current content inequality message:

I think it'd be better to align this output. It'd also be better to put "Actual Content | Expected Content" on a newline.
[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
[info] Actual Content | Expected Content
[info] [frank,44,us] | [frank,44,us]
[info] [li,30,china] | [li,30,china]
[info] [bob,1,uk] | [bob,1,france]
[info] [camila,5,peru] | [camila,5,peru]
[info] [maria,19,colombia] | [maria,19,colombia]
It'd be really nice to suppress all the info warnings, but not sure if that's possible with Scalatest.
[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
Actual Content | Expected Content
[frank,44,us] | [frank,44,us]
[li,30,china] | [li,30,china]
[bob,1,uk] | [bob,1,france]
[camila,5,peru] | [camila,5,peru]
[maria,19,colombia] | [maria,19,colombia]
Should we get rid of the square brackets for each row of data too?
[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
Actual Content | Expected Content
frank,44,us | frank,44,us
li,30,china | li,30,china
bob,1,uk | bob,1,france
camila,5,peru | camila,5,peru
maria,19,colombia | maria,19,colombia
@carlsverre @nvander1 @gorros - Can you please take a look and provide thoughts on the best error message we can provide users for DataFrame inequality comparisons? Thanks!
See here for the utest output that doesn't have all the info warnings: https://github.com/MrPowers/spark-fast-tests/issues/64
I agree.
I like this but I would add spaces between the values:
[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
Actual Content | Expected Content
frank, 44, us | frank, 44, us
li, 30, china | li, 30, china
Also consider outputting strings wrapped with "
characters so it's obvious which parts of the row is inside and outside the string. Consider the case that a string is empty or contains a comma
@gorros @carlsverre - Here's a PR to migrate spark-fast-tests back to Scalatest (it's currently using utest): https://github.com/MrPowers/spark-fast-tests/pull/69
I think it'll be easier to develop the optimal Scalatest output if this repo is actually using Scalatest ;)
Let me know your thoughts!
Here's the current DataFrame comparison message:

Here's the new message (added in this PR):

@carlsverre @gorros @snithish - can you please take a look and let me know if this output looks better / you have any suggestions. Some specific points to note:
- I changed the colors for the matching rows from Blue to DarkGray. Do you think that's better? Here's the list of color options.
- I needed to prepend
"Diffs\n"
to get the message to output on a newline in Scalatest."\n"
worked for uTest, but not for Scalatest. I also tried hacking in the null character with"\u0000\n"
, but Scalatest ignored that too. So looks like we need some sort of real character.
Good catch - ScalaTest is almost certainly running trim on the string before printing it which will remove all leading/trailing whitespace. I guess a null byte is also considered part of that... If you can get blue to work I think that's better - dark grey can be set to be very similar to the shell bg color in some colorschemes. This looks good to me though - love the new format!
@MrPowers : Re: brackets around the values, I would recommend keeping those, as it helps to avoid subtle issues around spaces and tabs and such that can affect inequality but be hidden without such delimiters, e.g.
[foo, bar] vs. [foo, bar ]
I agree the alignment is definitely a plus.
I agree with @carlsverre that blue would be better than dark grey in terms of colors for equality.
I would love to see something that shows what column values are different. This is especially important for larger data frames that may have 50 columns.