DataProfiler
DataProfiler copied to clipboard
`_assimilate_histogram` and `_regenerate_histogram` refactor into standalone functions
Is your feature request related to a problem? Please describe.
_assimilate_histogram
and _regenerate_histogram
functions are not using self for anything of substance and as a result can be separated into their own standalone static functions.
Describe the outcome you'd like: Move these two functions outside of the class to histogram_utils.py
Additional context:
I presume this would also involve writing unit tests for _assimilate_histogram
and _regenerate_histogram
in test_histogram_utils.py
? Neither have any direct unit tests at the moment, though they might be called indirectly from other tests
Yes, definitely would want to ensure coverage with testing for sure @junholee6a
Working on this. It'll take a bit to understand the functions and write test cases
In _assimilate_histogram()
, it seems like each bin in the from_hist
maps to at most two bins in the dest_bin
. This means that when from_hist
has much larger bins than dest_bin
, its values aren't distributed evenly:
Test code:
def test_assimilate_histogram(self):
ret_value = histogram_utils._assimilate_histogram(
from_hist_entity_count_per_bin=np.array([50, 0]),
from_hist_bin_edges=np.array([0, 5, 10]),
dest_hist_entity_count_per_bin=np.zeros(10),
dest_hist_bin_edges=np.array((0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)),
dest_hist_num_bin=10,
)
print('=============== _assimilate_histogram result ===============')
print(ret_value)
Output (reformatted by hand):
=============== _assimilate_histogram result ===============
(
{
'bin_edges': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]),
'bin_counts': array([10., 40., 0., 0., 0., 0., 0., 0., 0., 0.])
},
80.0
)
Expected output: bin_counts
should be distributed more evenly?
=============== _assimilate_histogram result ===============
(
{
'bin_edges': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]),
'bin_counts': array([10., 10., 10., 10., 10., 0., 0., 0., 0., 0.])
},
some_loss_value
)
Is this a bug?
Potentially related issue: https://github.com/capitalone/DataProfiler/issues/838
#838 is related. If we merge the capabilities and then make that merged function a standalone that would essentially solve both issues, but we separated these out as they accomplish different goals are are hefty tasks on their own for one developer.
In
_assimilate_histogram()
, it seems like each bin in thefrom_hist
maps to at most two bins in thedest_bin
. This means that whenfrom_hist
has much larger bins thandest_bin
, its values aren't distributed evenly:Test code:
def test_assimilate_histogram(self): ret_value = histogram_utils._assimilate_histogram( from_hist_entity_count_per_bin=np.array([50, 0]), from_hist_bin_edges=np.array([0, 5, 10]), dest_hist_entity_count_per_bin=np.zeros(10), dest_hist_bin_edges=np.array((0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)), dest_hist_num_bin=10, ) print('=============== _assimilate_histogram result ===============') print(ret_value)
Output (reformatted by hand):
=============== _assimilate_histogram result =============== ( { 'bin_edges': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]), 'bin_counts': array([10., 40., 0., 0., 0., 0., 0., 0., 0., 0.]) }, 80.0 )
Expected output:
bin_counts
should be distributed more evenly?=============== _assimilate_histogram result =============== ( { 'bin_edges': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]), 'bin_counts': array([10., 10., 10., 10., 10., 0., 0., 0., 0., 0.]) }, some_loss_value )
Is this a bug?
Looking into this now
So there is a couple of things with this scenario that we cant necessarily make assumptions about. While I agree that the actual result:
{
'bin_edges': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]),
'bin_counts': array([10., 40., 0., 0., 0., 0., 0., 0., 0., 0.])
},
Is not super intuitive, I also do not like the evenly distributed assumption:
{
'bin_edges': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]),
'bin_counts': array([10., 10., 10., 10., 10., 0., 0., 0., 0., 0.])
},
because of the fact that there is a pretty big assumption there as well. We are in both scenarios technically misrepresenting the data as its making the bins smaller and as a result forcing us to make those assumptions. This is an open question within the team and we are definitely open to suggestions. My personal thought is whatever we choose there needs to be a warning about losing accuracy if the bin sizes go from larger to smaller (less bins -> more bins).
@taylorfturner @micdavis @JGSweets @junholee6a thoughts?
because of the fact that there is a pretty big assumption there as well.
Definitely, this assumes that the values in from_hist
are evenly distributed within each bin, but I don't think there's a more intuitive alternative without more information about the data represented by the histograms.
My personal thought is whatever we choose there needs to be a warning about losing accuracy if the bin sizes go from larger to smaller (less bins -> more bins).
Maybe this is covered by returning a higher hist_loss
value? Accuracy can be lost when using _assimilate_histogram
even if the bin sizes don't change, i.e. when the bins are only shifted sideways from from_hist
to dest_hist
.
After thinking about it, I may be leaning more towards your suggested expected outcome, I think your right in that there is not a more intuitive approach without some other significant details being shared. If @taylorfturner and @micdavis are in agreement I can throw a bug report up to implement that fix instead of its current state, which IMO is worse than assuming the even distribution.
Thank you @junholee6a for your input. This call out is VERY much appreciated and I am hopeful this will be looked into as time provides!!!