DataProfiler `_assimilate_histogram` and `_regenerate_histogram` refactor into standalone functions

Is your feature request related to a problem? Please describe. _assimilate_histogram and _regenerate_histogram functions are not using self for anything of substance and as a result can be separated into their own standalone static functions.

Describe the outcome you'd like: Move these two functions outside of the class to histogram_utils.py

Additional context:

May 15 '23 17:05 ksneab7

I presume this would also involve writing unit tests for _assimilate_histogram and _regenerate_histogram in test_histogram_utils.py? Neither have any direct unit tests at the moment, though they might be called indirectly from other tests

Aug 04 '23 00:08 junholee6a

Yes, definitely would want to ensure coverage with testing for sure @junholee6a

Aug 04 '23 01:08 taylorfturner

Working on this. It'll take a bit to understand the functions and write test cases

Aug 05 '23 19:08 junholee6a

In _assimilate_histogram(), it seems like each bin in the from_hist maps to at most two bins in the dest_bin. This means that when from_hist has much larger bins than dest_bin, its values aren't distributed evenly:

Test code:

def test_assimilate_histogram(self):
    ret_value = histogram_utils._assimilate_histogram(
        from_hist_entity_count_per_bin=np.array([50, 0]),
        from_hist_bin_edges=np.array([0, 5, 10]),
        dest_hist_entity_count_per_bin=np.zeros(10),
        dest_hist_bin_edges=np.array((0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)),
        dest_hist_num_bin=10,
    )
    print('=============== _assimilate_histogram result ===============')
    print(ret_value)

Output (reformatted by hand):

=============== _assimilate_histogram result ===============
(
    {
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 40.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
    },
    80.0
)

Expected output: bin_counts should be distributed more evenly?

=============== _assimilate_histogram result ===============
(
    {
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 10.,  10.,  10.,  10.,  0.,  0.,  0.,  0.,  0.])
    },
    some_loss_value
)

Is this a bug?

Aug 09 '23 03:08 junholee6a

Potentially related issue: https://github.com/capitalone/DataProfiler/issues/838

Aug 09 '23 15:08 junholee6a

#838 is related. If we merge the capabilities and then make that merged function a standalone that would essentially solve both issues, but we separated these out as they accomplish different goals are are hefty tasks on their own for one developer.

Aug 09 '23 15:08 ksneab7

In _assimilate_histogram(), it seems like each bin in the from_hist maps to at most two bins in the dest_bin. This means that when from_hist has much larger bins than dest_bin, its values aren't distributed evenly:

Test code:

def test_assimilate_histogram(self):
    ret_value = histogram_utils._assimilate_histogram(
        from_hist_entity_count_per_bin=np.array([50, 0]),
        from_hist_bin_edges=np.array([0, 5, 10]),
        dest_hist_entity_count_per_bin=np.zeros(10),
        dest_hist_bin_edges=np.array((0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)),
        dest_hist_num_bin=10,
    )
    print('=============== _assimilate_histogram result ===============')
    print(ret_value)

Output (reformatted by hand):

=============== _assimilate_histogram result ===============
(
    {
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 40.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
    },
    80.0
)

Expected output: bin_counts should be distributed more evenly?

=============== _assimilate_histogram result ===============
(
    {
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 10.,  10.,  10.,  10.,  0.,  0.,  0.,  0.,  0.])
    },
    some_loss_value
)

Is this a bug?

Looking into this now

Aug 09 '23 16:08 ksneab7

So there is a couple of things with this scenario that we cant necessarily make assumptions about. While I agree that the actual result:

{
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 40.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
},

Is not super intuitive, I also do not like the evenly distributed assumption:

{
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 10.,  10.,  10.,  10.,  0.,  0.,  0.,  0.,  0.])
},

because of the fact that there is a pretty big assumption there as well. We are in both scenarios technically misrepresenting the data as its making the bins smaller and as a result forcing us to make those assumptions. This is an open question within the team and we are definitely open to suggestions. My personal thought is whatever we choose there needs to be a warning about losing accuracy if the bin sizes go from larger to smaller (less bins -> more bins).

@taylorfturner @micdavis @JGSweets @junholee6a thoughts?

Aug 09 '23 16:08 ksneab7

because of the fact that there is a pretty big assumption there as well.

Definitely, this assumes that the values in from_hist are evenly distributed within each bin, but I don't think there's a more intuitive alternative without more information about the data represented by the histograms.

My personal thought is whatever we choose there needs to be a warning about losing accuracy if the bin sizes go from larger to smaller (less bins -> more bins).

Maybe this is covered by returning a higher hist_loss value? Accuracy can be lost when using _assimilate_histogram even if the bin sizes don't change, i.e. when the bins are only shifted sideways from from_hist to dest_hist.

Aug 09 '23 17:08 junholee6a

After thinking about it, I may be leaning more towards your suggested expected outcome, I think your right in that there is not a more intuitive approach without some other significant details being shared. If @taylorfturner and @micdavis are in agreement I can throw a bug report up to implement that fix instead of its current state, which IMO is worse than assuming the even distribution.

Aug 09 '23 20:08 ksneab7

Thank you @junholee6a for your input. This call out is VERY much appreciated and I am hopeful this will be looked into as time provides!!!

Aug 10 '23 15:08 ksneab7

DataProfiler DataProfiler copied to clipboard

`_assimilate_histogram` and `_regenerate_histogram` refactor into standalone functions

DataProfiler
DataProfiler copied to clipboard