DataProfiler icon indicating copy to clipboard operation
DataProfiler copied to clipboard

Investigate refactor of histogram_to_array for better accuracy

Open JGSweets opened this issue 3 years ago • 2 comments

Currently, the _histogram_to_array function does not use the midpoint of the bins to recreate the original dataset. Investigate accuracy of the _histogram_to_array function currently in comparison to (which uses the midpoint):

def _histogram_to_array(self):
    # Extend histogram to array format
    bin_counts = self._stored_histogram['histogram']['bin_counts']
    bin_edges = self._stored_histogram['histogram']['bin_edges']
    is_bin_non_zero = bin_counts > 0
    bin_midpoints = (bin_edges[1:][is_bin_non_zero]
                     + bin_edges[:-1][is_bin_non_zero]) / 2
    hist_to_array = [
        [midpoint] * count for midpoint, count
        in zip(bin_midpoints, bin_counts[is_bin_non_zero])
    ]
    array_flatten = np.concatenate(hist_to_array)

    # the min/max must be preserved
    array_flatten[0] = bin_edges[0]
    array_flatten[-1] = bin_edges[-1]

    # If we know they are integers, we can limit the data to be as such
    # during conversion
    if not self.__class__.__name__ == 'FloatColumn':
        array_flatten = np.round(array_flatten)

    return array_flatten

JGSweets avatar Apr 20 '21 15:04 JGSweets

@JGSweets is this still something we should look into?

lettergram avatar Aug 31 '21 20:08 lettergram

It might improve its accuracy, but TBH unsure of the overall results.

JGSweets avatar Sep 01 '21 13:09 JGSweets