Spark.TableStatsExample icon indicating copy to clipboard operation
Spark.TableStatsExample copied to clipboard

sumLong bug in ColumnStats.scala and TestTableStatsSinglePathMain.scala

Open BrentDorsey opened this issue 8 years ago • 0 comments

Thanks for sharing, this performs significantly better than what I was using! While validating the getFirstPassStat statistics on our data I discovered a sumLong bug in ColumnStats.scala Part B.1.1.

ColumnStats.scala - Because the sumLong calculation is happening after the reduce the bug returns the sum of the unique values from the column instead of summing all the values in the column. The fix is simply multiplying the unique column values by the number of times the value appears in the partition.

Bug: sumLong += colLongValue Fix: sumLong += (colLongValue * colCount)

The following else if adds support for Double:

else if (colValue.isInstanceOf[Double]) {
val colDoubleValue = colValue.asInstanceOf[Double]
if (maxDouble colDoubleValue) minDouble = colDoubleValue
sumDouble += (colDoubleValue * colCount)
}

TestTableStatsSinglePathMain.scala - Because all the id values are unique the sumLong assertion isn't catching the bug. Adding the following sumLong test for age:

assertResult(98l)(firstPassStats.columnStatsMap(2).sumLong)

Fails the test returning:

  • run table stats on sample data *** FAILED *** Expected 98, but got 38

98 = 20 + 20 + 20 + 20 + 10 + 8 38 = 20 + 10 + 8

BrentDorsey avatar Mar 05 '16 05:03 BrentDorsey