skpro
skpro copied to clipboard
[ENH] Improve efficiency of `Histogram Distribution`
Describe the maintenance issue
The Histogram distribution
implemented in #382, #335 takes 2 parameters bins
and bin_mass
which consists of ragged arrays in the 2D
case. These ragged arrays are stored in a list
as it is not possible to vectorize the ragged inputs easily. One way to vectorize them and make all of them the same shape is to pad them with 0
s in case of bin_mass
and pad bins
with -np.inf
and np.inf
on the left and right side to make them equal in length.
But the problem with this approach of vectorizing is that it takes longer time than the current approach, as although the running times of the methods mean
,var
per se are improved by a factor of 5
but the time to pad the inputs in the above mentioned way itself is taking a lot of time which is giving worse efficiency results overall than the current approach.
Refer here to know more about the benchmarking of the Histogram Distribution.
The idea of taking the input from the user itself in this vectorized way with all the inputs padded with 0
s and inf
s does not seem to be a very good idea as this would be very inconvenient for the user to pad them manually in cases where the lengths of the inputs vary by a big number and this would also not allow for tuple
inputs in cases where the bins are of equal widths.
The Histogram Distribution
inherits from the _BaseArrayDistribution
which inherits the BaseDistribution
with some overriding of private functions to accomodate the array distribuitons. Thoughts on ways of merging this with BaseDistribution
without having to create a separate base class for arrays is also appreciated.