Spark.TableStatsExample
Spark.TableStatsExample copied to clipboard
Simple Spark example of generating table stats for use of data quality checks
…ubsequent updates When you insert the maxSize(th) value for the first time, update the lowest count and add the element as well. When modifying the TopNList, just perform inplace updates...
The min long calculation originally would take the max of the min values instead of the min
The original would only add the first N key-values encountered
Thanks for sharing, this performs significantly better than what I was using! While validating the getFirstPassStat statistics on our data I discovered a sumLong bug in ColumnStats.scala [Part B.1.1](http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/comment-page-1/#comment-74803). [ColumnStats.scala](https://github.com/tmalaska/Spark.TableStatsExample/blob/master/src/main/scala/com/cloudera/sa/examples/tablestats/model/ColumnStats.scala)...
nothing major here. just some suggestions, you don't need to like them all :). I broke it up into a few commits, might be easier to look at one commit...