vigra icon indicating copy to clipboard operation
vigra copied to clipboard

Implement a histogram class

Open ukoethe opened this issue 14 years ago • 5 comments

Besides the usual functionality, it should also support overlapping bins (i.e. sampling prefilters that are better than the simple box function of a naive histogram class).

ukoethe avatar Jul 21 '11 10:07 ukoethe

Oh, yeah, great!

Let's make more clear what one conceptually excepts from this:

  • Methods for configuring and querying the mapping from domain values to histogram indices and vice versa (for bin edges and centers). (binWidth(), binCount(), ...) -- note the past discussions (and behavior changes) in numpy w.r.t. the outermost bins (some people expect out-of-range values to be ignored, some want them be collected in these bins)
  • Methods for adding values, potentially weighted (or "unchecked", i.e. w/o bounds checking)
  • Methods for "reading" the histogram, i.e. array-like access
  • Convenience methods for computing histograms, i.e. for collecting values from images, with optional masking (some people seem to want an optional special "background value" for masking - looks esoteric to me though)
  • Normalizing (into distributions, taking bin sizes into account)

Those are the minimum methods AFAICS. Now it becomes more interesting; maybe we need different (sub-)classes for the following:

  • Joined, multi-dimensional (in particular, 2D) histograms
  • "Overlapping bins" as you called them, i.e. encoding / decoding functions for writing / reading values
  • Non-equal-sized bins (irregular bin edge positions)?

Other stuff supported by the histogram class in MeVisLab, which is however ill-designed:

  • Histogram analysis
    • Peak detection (single or multi-mode)
    • Quantile / FWHM / mean / stddev estimation
    • Entropy computation
  • Correlation
  • Smoothing using different diffusion models
  • Scaling

IMO, these do not belong into the histogram class itself, but maybe a histogram should allow random r/w access to its array data, which would allow to do these quite easily using helper functions.

hmeine avatar Jul 21 '11 12:07 hmeine

As I wrote in issue #45, I just discovered that the boost::accumulators library also contains a histogram functor. (boost::accumularots offers nice, well-designed, numerically stable functors for deriving various statistics in a flexible, yet efficient manner.) Anyhow, it is definitely worth a look, even if it does not seem to solve our API problem.

(I wonder if there is another histogram class in boost or similar libraries.)

The histogram features is hidden behind the term "density": http://www.boost.org/doc/libs/1_47_0/doc/html/boost/accumulators/tag/density.html

There has been some discussion in 2008 about a simpler way to specify min/max/binCount, but I am not sure if that was ever committed (I don't see it in the docs): http://lists.boost.org/Archives/boost/2008/01/132789.php + follow-ups

hmeine avatar Jul 25 '11 13:07 hmeine

The boost density class is quite nice. However, there are two drawbacks:

  • In image analysis, the data do not usually arrive in random order (e.g. you may find a lot of "sky" pixels at the beginning of the scan). Estimating min and max from the first n data elements is therefore unlikely to give good range estimates.
  • The functionality of overlapping bins (or, more generally, discretization kernels) is crucial for the subsequent implementation of a channel representation.

ukoethe avatar Jul 25 '11 16:07 ukoethe

I just found a new requirement: It makes sense to offer functionality for increasing / decreasing the range of existing histograms (while retaining bin contents). I assume that we only want this without resampling, although I can picture other people even wanting the latter. Actually, if we assume that our histogram allows array-like access not only for reading, resampling could easily be done with external functions. The same goes for smoothing and the like, which is why I think an is-a-array approach would be promising.

Update: @ukoethe’s comment on "Estimating min and max from the first n data elements …" refers to boost::accumulators’ implementation of the histogram, density_impl, which is only suitable for IIR samples: "The positions and sizes of the bins are determined using a specifiable number of cached samples…"

hmeine avatar Jan 20 '12 11:01 hmeine

Support for a major histogram use case has been implemented in the Feature Accumulator framework (histograms and quantiles over the intensities in labeled regions). Histograms over local windows as in a channel representation remains open (cf. issue #39).

ukoethe avatar Nov 06 '12 09:11 ukoethe