DataProfiler
DataProfiler copied to clipboard
Investigate refactor of histogram_to_array for better accuracy
Currently, the _histogram_to_array
function does not use the midpoint of the bins to recreate the original dataset.
Investigate accuracy of the _histogram_to_array
function currently in comparison to (which uses the midpoint):
def _histogram_to_array(self):
# Extend histogram to array format
bin_counts = self._stored_histogram['histogram']['bin_counts']
bin_edges = self._stored_histogram['histogram']['bin_edges']
is_bin_non_zero = bin_counts > 0
bin_midpoints = (bin_edges[1:][is_bin_non_zero]
+ bin_edges[:-1][is_bin_non_zero]) / 2
hist_to_array = [
[midpoint] * count for midpoint, count
in zip(bin_midpoints, bin_counts[is_bin_non_zero])
]
array_flatten = np.concatenate(hist_to_array)
# the min/max must be preserved
array_flatten[0] = bin_edges[0]
array_flatten[-1] = bin_edges[-1]
# If we know they are integers, we can limit the data to be as such
# during conversion
if not self.__class__.__name__ == 'FloatColumn':
array_flatten = np.round(array_flatten)
return array_flatten
@JGSweets is this still something we should look into?
It might improve its accuracy, but TBH unsure of the overall results.