Spark.TableStatsExample icon indicating copy to clipboard operation
Spark.TableStatsExample copied to clipboard

Simple Spark example of generating table stats for use of data quality checks

Results 5 Spark.TableStatsExample issues
Sort by recently updated
recently updated
newest added

…ubsequent updates When you insert the maxSize(th) value for the first time, update the lowest count and add the element as well. When modifying the TopNList, just perform inplace updates...

The min long calculation originally would take the max of the min values instead of the min

The original would only add the first N key-values encountered

Thanks for sharing, this performs significantly better than what I was using! While validating the getFirstPassStat statistics on our data I discovered a sumLong bug in ColumnStats.scala [Part B.1.1](http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/comment-page-1/#comment-74803). [ColumnStats.scala](https://github.com/tmalaska/Spark.TableStatsExample/blob/master/src/main/scala/com/cloudera/sa/examples/tablestats/model/ColumnStats.scala)...

nothing major here. just some suggestions, you don't need to like them all :). I broke it up into a few commits, might be easier to look at one commit...