DataProfiler icon indicating copy to clipboard operation
DataProfiler copied to clipboard

`_assimilate_histogram` and `_regenerate_histogram` refactor into standalone functions

Open ksneab7 opened this issue 1 year ago • 11 comments

Is your feature request related to a problem? Please describe. _assimilate_histogram and _regenerate_histogram functions are not using self for anything of substance and as a result can be separated into their own standalone static functions.

Describe the outcome you'd like: Move these two functions outside of the class to histogram_utils.py

Additional context:

ksneab7 avatar May 15 '23 17:05 ksneab7

I presume this would also involve writing unit tests for _assimilate_histogram and _regenerate_histogram in test_histogram_utils.py? Neither have any direct unit tests at the moment, though they might be called indirectly from other tests

junholee6a avatar Aug 04 '23 00:08 junholee6a

Yes, definitely would want to ensure coverage with testing for sure @junholee6a

taylorfturner avatar Aug 04 '23 01:08 taylorfturner

Working on this. It'll take a bit to understand the functions and write test cases

junholee6a avatar Aug 05 '23 19:08 junholee6a

In _assimilate_histogram(), it seems like each bin in the from_hist maps to at most two bins in the dest_bin. This means that when from_hist has much larger bins than dest_bin, its values aren't distributed evenly:

Test code:

def test_assimilate_histogram(self):
    ret_value = histogram_utils._assimilate_histogram(
        from_hist_entity_count_per_bin=np.array([50, 0]),
        from_hist_bin_edges=np.array([0, 5, 10]),
        dest_hist_entity_count_per_bin=np.zeros(10),
        dest_hist_bin_edges=np.array((0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)),
        dest_hist_num_bin=10,
    )
    print('=============== _assimilate_histogram result ===============')
    print(ret_value)

Output (reformatted by hand):

=============== _assimilate_histogram result ===============
(
    {
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 40.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
    },
    80.0
)

Expected output: bin_counts should be distributed more evenly?

=============== _assimilate_histogram result ===============
(
    {
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 10.,  10.,  10.,  10.,  0.,  0.,  0.,  0.,  0.])
    },
    some_loss_value
)

Is this a bug?

junholee6a avatar Aug 09 '23 03:08 junholee6a

Potentially related issue: https://github.com/capitalone/DataProfiler/issues/838

junholee6a avatar Aug 09 '23 15:08 junholee6a

#838 is related. If we merge the capabilities and then make that merged function a standalone that would essentially solve both issues, but we separated these out as they accomplish different goals are are hefty tasks on their own for one developer.

ksneab7 avatar Aug 09 '23 15:08 ksneab7

In _assimilate_histogram(), it seems like each bin in the from_hist maps to at most two bins in the dest_bin. This means that when from_hist has much larger bins than dest_bin, its values aren't distributed evenly:

Test code:

def test_assimilate_histogram(self):
    ret_value = histogram_utils._assimilate_histogram(
        from_hist_entity_count_per_bin=np.array([50, 0]),
        from_hist_bin_edges=np.array([0, 5, 10]),
        dest_hist_entity_count_per_bin=np.zeros(10),
        dest_hist_bin_edges=np.array((0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)),
        dest_hist_num_bin=10,
    )
    print('=============== _assimilate_histogram result ===============')
    print(ret_value)

Output (reformatted by hand):

=============== _assimilate_histogram result ===============
(
    {
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 40.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
    },
    80.0
)

Expected output: bin_counts should be distributed more evenly?

=============== _assimilate_histogram result ===============
(
    {
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 10.,  10.,  10.,  10.,  0.,  0.,  0.,  0.,  0.])
    },
    some_loss_value
)

Is this a bug?

Looking into this now

ksneab7 avatar Aug 09 '23 16:08 ksneab7

So there is a couple of things with this scenario that we cant necessarily make assumptions about. While I agree that the actual result:

{
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 40.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
},

Is not super intuitive, I also do not like the evenly distributed assumption:

{
        'bin_edges': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
        'bin_counts': array([10., 10.,  10.,  10.,  10.,  0.,  0.,  0.,  0.,  0.])
},

because of the fact that there is a pretty big assumption there as well. We are in both scenarios technically misrepresenting the data as its making the bins smaller and as a result forcing us to make those assumptions. This is an open question within the team and we are definitely open to suggestions. My personal thought is whatever we choose there needs to be a warning about losing accuracy if the bin sizes go from larger to smaller (less bins -> more bins).

@taylorfturner @micdavis @JGSweets @junholee6a thoughts?

ksneab7 avatar Aug 09 '23 16:08 ksneab7

because of the fact that there is a pretty big assumption there as well.

Definitely, this assumes that the values in from_hist are evenly distributed within each bin, but I don't think there's a more intuitive alternative without more information about the data represented by the histograms.

My personal thought is whatever we choose there needs to be a warning about losing accuracy if the bin sizes go from larger to smaller (less bins -> more bins).

Maybe this is covered by returning a higher hist_loss value? Accuracy can be lost when using _assimilate_histogram even if the bin sizes don't change, i.e. when the bins are only shifted sideways from from_hist to dest_hist.

junholee6a avatar Aug 09 '23 17:08 junholee6a

After thinking about it, I may be leaning more towards your suggested expected outcome, I think your right in that there is not a more intuitive approach without some other significant details being shared. If @taylorfturner and @micdavis are in agreement I can throw a bug report up to implement that fix instead of its current state, which IMO is worse than assuming the even distribution.

ksneab7 avatar Aug 09 '23 20:08 ksneab7

Thank you @junholee6a for your input. This call out is VERY much appreciated and I am hopeful this will be looked into as time provides!!!

ksneab7 avatar Aug 10 '23 15:08 ksneab7