aim Pre-binned distribution/histogram

🚀 Feature

A user should be able to create an aim.Distribution using histogram data that the user computed. Perhaps by specifying some flag that disables the automatic internal numpy.histogram.

Motivation

Sometimes the exact histogram is known, rather than needing to a sample it from some random source.

Pitch

counts = [4, 2, 1]
bin_edges = [0, 9, 18, 99]  # or bin_midpoints?
aim.Distribution(counts=counts, bin_edges=bin_edges)

Alternatives

Plotting a typical plot.ly figure instead.

Additional context

N/A

Sep 20 '22 08:09 YodaEmbedding

Hey @YodaEmbedding! Thanks for opening the issue; it seems a logical extension of Distribution object functionality. It would be awesome if you contribute the appropriate enhancement to aim 🙌 Would be happy to share some code pointers.

Sep 20 '22 10:09 alberttorosyan

In terms of the API, does this new __init__ interface seem reasonable?

class Distribution(CustomObject):
    """Distribution object used to store distribution objects in Aim repository.

    Args:
        data (:obj:): Optional array-like object of data sampled from a distribution.
        hist (:obj:): Optional array-like object representing bin frequency counts.
            Must be specified alongside `bin_edges`. `data` must not be specified.
        bin_edges (:obj:): Optional array-like object representing bin edges.
            Must be specified alongside `hist`. `data` must not be specified.
        bin_count (:obj:`int`, optional): Optional distribution bin count for
            binning `data`. 64 by default, max 512.
    """

    def __init__(self, data=None, *, hist=None, bin_edges=None, bin_count=64):
        super().__init__()

        if not isinstance(bin_count, int):
            raise TypeError('`bin_count` must be an integer.')
        if 1 > bin_count > 512:
            raise ValueError('Supported range for `bin_count` is [1, 512].')
        self.storage['bin_count'] = bin_count

        np_histogram = self._to_np_histogram(data, hist, bin_edges, bin_count)
        self._from_np_histogram(np_histogram)

    def _to_np_histogram(self, data, hist, bin_edges, bin_count):
        if data is None:
            if hist is None or bin_edges is None:
                raise ValueError('Both `hist` and `bin_edges` must be specified.')
            return np.asanyarray(hist), np.asanyarray(bin_edges)
        if hist is not None or bin_edges is not None:
            raise ValueError(
                '`hist` and `bin_edges` may not be specified if `data` is.'
            )
        # convert to np.histogram
        try:
            return np.histogram(data, bins=bin_count)
        except TypeError:
            raise TypeError(
                f'Cannot convert to aim.Distribution. Unsupported type {type(data)}.'
            )

Usage:

# Compatible with old interface:
aim.Distribution(sampled_data)

# Supports new usage:
hist, bin_edges = np.histogram(sampled_data)
aim.Distribution(hist=hist, bin_edges=bin_edges)

Supporting both data and (hist, bin_edges) makes it look a bit more complicated than, e.g. deprecating data and forcing the user to do np.histogram themselves whenever they need it.

Also, is there a reason behind setting bin_count=512 as the max?

Sep 20 '22 19:09 YodaEmbedding

@YodaEmbedding will take a look and get back soon. Regarding the limitation of 512 bins; this is done mainly for performance considerations. UI component used for showing distribution data has some rendering issues with high number of bins.

Sep 21 '22 07:09 alberttorosyan

@YodaEmbedding, regarding this:

Supporting both data and (hist, bin_edges) makes it look a bit more complicated than, e.g. deprecating data and forcing the user to do np.histogram themselves whenever they need it.

what about adding named constructor(s) (classmethod to create Distribution object) which will initialize and set properties of the newly constructed object? this will allow to"offload" some of the interface form __init__ method.

Sep 22 '22 10:09 alberttorosyan

Attempt 2 (much cleaner):

class Distribution(CustomObject):
    """Distribution object used to store distribution objects in Aim repository."""

    def __init__(self, hist, bin_edges):
        super().__init__()
        hist = np.asanyarray(hist)
        bin_edges = np.asanyarray(bin_edges)
        self._from_np_histogram(hist, bin_edges)

    @classmethod
    def from_histogram(cls, hist, bin_edges):
        """Create Distribution object from histogram.

        Args:
            hist (:obj:): Array-like object representing bin frequency counts.
                Must be specified alongside `bin_edges`. `data` must not be specified.
            bin_edges (:obj:): Array-like object representing bin edges.
                Must be specified alongside `hist`. `data` must not be specified.
                Max 512 bins allowed.
        """
        return cls(hist, bin_edges)

    @classmethod
    def from_samples(cls, samples, bin_count=64):
        """Create Distribution object from data samples.

        Args:
            samples (:obj:): Array-like object of data sampled from a distribution.
            bin_count (:obj:`int`, optional): Optional distribution bin count for
                binning `samples`. 64 by default, max 512.
        """

        # These checks can perhaps be handled by np.histogram.
        # if not isinstance(bin_count, int):
        #     raise TypeError("`bin_count` must be an integer.")
        # try:
        #     hist, bin_edges = np.histogram(samples, bins=bin_count)
        # except TypeError:
        #     raise TypeError(f"Cannot create histogram from type {type(samples)}.")

        hist, bin_edges = np.histogram(samples, bins=bin_count)
        return cls(hist, bin_edges)

    def _from_np_histogram(self, hist, bin_edges):
        bin_count = len(bin_edges) - 1
        if 1 > bin_count > 512:
            raise ValueError("Supported range for `bin_count` is [1, 512].")

        # Checks unnecessary due to asanyarray.
        # assert isinstance(hist, np.ndarray)
        # assert isinstance(bin_edges, np.ndarray)

        self.storage["data"] = BLOB(data=hist.tobytes())
        self.storage["dtype"] = str(hist.dtype)
        self.storage["bin_count"] = bin_count
        self.storage["range"] = [bin_edges[0].item(), bin_edges[-1].item()]

Sep 22 '22 23:09 YodaEmbedding

@YodaEmbedding looks good! I think the next step would be to open a PR and test the changes to make sure everything works fine (including UI part)

Sep 27 '22 06:09 alberttorosyan