geotrellis icon indicating copy to clipboard operation
geotrellis copied to clipboard

Feature: Jenks Natural Breaks

Open gfinch opened this issue 10 years ago • 1 comments

It would be great if GeoTrellis would support Jenks Natural Breaks in addition to Quantile Breaks.

Jenks Natural Breaks "[seeks] to minimize each class’s average deviation from the class mean, while maximizing each class’s deviation from the means of the other groups." (Wikipedia).

You can visualize the difference between Jenks Natural Breaks and Quantile Breaks here: http://bl.ocks.org/tmcw/4969184

Read more about Jenks Natural Breaks on Wikipedia: https://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization

Here is a javascript implementation: https://github.com/simple-statistics/simple-statistics/blob/v0.9.2/src/simple_statistics.js#L854 (jenks, jenksMatrices, and jenksBreaks functions)

gfinch avatar Nov 16 '15 17:11 gfinch

Thanks @gfinch, this is great stuff. And since there's a well-licensed open source example of it in javascript that we can work off of, I don't think it should be too hard to implement (cue famous last words quote).

This is the algorithm that we use to calculate quantile breaks: https://github.com/geotrellis/geotrellis/blob/master/raster/src/main/scala/geotrellis/raster/histogram/MutableHistogram.scala#L53

It lives in Histogram object, which you can create off a Tile (which is a single array-backed Raster). For example, to read in a geotiff and get the quantile breaks:

import geotrellis.raster._
import geotrellis.raster.io.geotiff._
import geotrellis.raster.op.stats._

val tile = SingleBandGeoTiff(path).tile
val histogram = tile.histogram
val breaks: Array[Int] = histogram.getQuantileBreaks(100) // 100 classes.

(warning, I didn't compile so probably has syntax errors)

Creating Jenks breaks from the histogram, instead of the direct raster values, would probably be a good idea, since there's a sorting step that would mean copying and sorting all raster values which would be memory and computationally expensive, and also using histogram allows for it to work better on a distributed set of tiles (by mapping each tile to a histogram, and then doing a reduce on those histograms, and creating the breaks off of that).

lossyrob avatar Nov 16 '15 18:11 lossyrob