spark-fast-tests
spark-fast-tests copied to clipboard
Failed assert when using precision and orderedComparison
The issue found regards the use of precision and orderedComparison = false When orderedComparison is false, the library automatically orders the dataframe by taking the columns in alphabetical order. It may happen (like in the example below) that the first column in the order is the one that uses the "precision". Therefore, it can happen that by ordering the columns, the result is not the expected one. You can see an example of what I'm talking about here:
it should "test" in {
val ds1 = Seq(
("1", "10/01/2019", 26.762499999999996),
("1", "11/01/2019", 26.762499999999996)
).toDF("col_B", "col_C", "col_A")
val ds2 = Seq(
("1", "10/01/2019", 26.762499999999946),
("1", "11/01/2019", 26.76249999999991)
).toDF("col_B", "col_C", "col_A")
assertApproximateDataFrameEquality(ds1, ds2, precision = 7, orderedComparison = false)
}
As you can see, the test should pass, because only the precision decimals > 7 are different, the rest of columns are the same. However, since "col_A" will go first in the "orderedComparison = false", the test will fail.
Some possible fixes could be:
- Apply precision before ordering columns
- Avoid using in the order columns that will be affected by precision (double columns)
A workaround for this issue is to order yourself the dataframes and set "orderedComparison = true"
@adrixgc - Here's a fix: https://github.com/MrPowers/spark-fast-tests/pull/92
Looks like we can avoid the column ordering. Let me know if this fix looks alright to you!
Thanks for reporting this edge case!
@MrPowers I checked the PR and that solves the current problem. Could you run this test and check if it passes too?
"can work with precision and unordered comparison 2" in {
import spark.implicits._
val ds1 = Seq(
("1", "10/01/2019", 26.762499999999996, "A"),
("1", "10/01/2019", 26.762499999999996, "B")
).toDF("col_B", "col_C", "col_A", "col_D")
val ds2 = Seq(
("1", "10/01/2019", 26.762499999999946, "A"),
("1", "10/01/2019", 26.76249999999991, "B")
).toDF("col_B", "col_C", "col_A", "col_D")
assertApproximateDataFrameEquality(ds1, ds2, precision = 0.0000001, orderedComparison = false)
}
I think this test will fail, even though it should pass. Let me know!
@adrixgc - you're right, thanks for the additional test case to illustrate the issue.
Pushed up another commit that only sorts the precise columns (not the float, decimal, or double) columns which should fix the issue.
Can you please take a look and let me know if this fix looks alright to you? Thanks!
Looks good to me! PR approved. Thanks for the quick response!