skpro icon indicating copy to clipboard operation
skpro copied to clipboard

skpro-refactoring (version-2)

Open jesellier opened this issue 4 years ago • 24 comments

See below some comments/description of the coming refactoring contents :

  • Distribution classes refactoring in a more OOD way (see. skpro->distribution)
  • Losse functions (see. metrics->distribution)
  • Estimators (see. metrics->distribution)

Some descriptive notebooks (in docs->notebooks) and a full set of unit test (in tests) are also available.

jesellier avatar Apr 07 '20 20:04 jesellier

Distribution :

The v2 Introduce an OOD implementation of the distribution class. The 'DistributionBase' class serves as the main base abstract for all distribution concrete class.

1. Vectorization and Parameters access : The vectorization functionality are maintain i.e. a single distribution object can be vectorized within a single object instanciation. A list of parameters is passed and kept as member. The subset of the vector can be accessed through a 'getitem' implementation as before. Some new parameter accessor have been added :

-> get_params(index) : override the scikit-learn 'get_params' implementation : 1.to get an indexed version of it if needed (since the distribution are here vectorized). 2. for the deep functionality to also works with list of estimators (and not just simple estimators) which is needed for mixture

-> get_param(key) : return a list of the keyed parameter ('string') accross the distributions. If the distribution is vectorized it will only operates for the distribution subset indexed by 'cached_index' (used for some mode of evaluation see below)

2. Additional component added to characterize the distribution type: -> distType enum : Discrete, Continuous, Mixed) -> Support class : object that return a boolean for a given scalar - by default instantiate a NullSupport that always return TRUE (see in distributions->component->support) They are passed by composite at the base class level.

3. Evaluation functions Interface (PDF/CDF) : The distribution evaluation methods implementation rely on a implicit interface design, to separate some processing/checking common to all distributions from the more specific implementation details.

The basic interface methods (pdf/pmf/cdf) are the ones visible and used by the the user. Each of these first perform a series of checks i.e. : 1. assess the validity of the distribution type for the function considered (ex. pdf should only accept continuous or mixed distribution), 2. ensure that the argument is within the support range, or 3. manage the mode of evaluation ([BATCH, ELEMENT_WISE] - describe later).

Then they call some abstracted implied methods (pdf_impl/pmf_impl/cdf_impl) that must contains the implementation details. They are declared abstract in the DPQRMixin and must be overriden in every sub-concrete class. If no concrete override is implemented, the DPQRMixin implementation raise an error by default.

skpro_distribution

4. Mode mechanism : The existing mode of evaluation mechanism (i.e. [BATCH] or [ELEMENT_WISE] have bee adapted). Every concrete evaluation methods are assumed to be implemented as BATCH mode. The parameters arguments must be called using the private method 'get_param' declared in the distribution base class, that returns the vectorized parameters sliced by a private 'cached_index' member. By default 'cached_index' is set (and reset) to 'slice(None)' which corresponds to a BATCH evaluation where all distributions are evaluated.

The ELEMENT_WISE mode (if activated) decorates the implied methods with 'elementWiseDecorator' . It only evaluates each distribution to a corresponding sample subset (according to the element-wise rule) by modyfying the cach-index before calling the parameters accessor. The results are then aggregated back into a result list

The 'Mode' of a distribution object can be accessed or changed by the user using the following methods:

-> getMode() output the current active Mode -> setMode() reset the current mode to (). Accept a 'Mode' enum argument [Mode.ELEMENT_WISE, Mode.BARCH]

5. Concrete class implementation : a list of concrete class have been implemented.

  • Normal
  • MultidimensionalNormal
  • Laplace
  • Customized (i.e. passing a functor as argument for the pdf and cdf)
  • Mixture

PS : Some implementation can be visible in the test (tests->distributution) or in the notebooks (notebboks->skpro_distribution_description and notebooks->skpro_distribution_interface)

jesellier avatar Apr 07 '20 20:04 jesellier

Losses

The refactoring introduce a OOD implementation for the losses as well.

1. Probabilistic losses as functor object : The logLoss, a ClippedLogLoss and the integratedLogLoss have been added.

  • they inherit from a common base "ProbabilisticLossFunction" (that contains some basic common methods for input checking).
  • and implement a call method that takes a skpro.distribution, an array of targets and return the losses.

Losses

2. ProbabilisticScorer : Similar to scikit-learn scorer, It acts as a wrap-up class that directly evaluates an error given a loss functor and a probabilistic estimator.

  • it passes the associated loss by composite (i.e. as member)
  • implements a 'call' method that takes a probabilistic estimator, an array of test samples (X), an array of targets (y) and return a scalar of total loss (with some specified calculation mode). The distribution mode is internally set to [ELEMENT_WISE] for the cacluation and then reset to original afterward.
  • It is compatible to the scikit-learn scoring interface (i.e. can be used in a probabilistic setting as scoring argument for scikit-learn GridSearchCV() or cross_val_score() functions)

PS : A default score method (directly instantiating and returning the scorer) can be associated to some ProbabilisticEstimator object by inheriting the mixin DefaultScoreMixin (in skpro->estimators->score - set to logLoss).

3. ModelManager for Scoring : A model Manager has been added to compute the scores or cross_val_score of a vector of model in a probabilistic setting (see skpro->cross_validation):

  • the folding of the CV is shared across the models (i.e. optimized)
  • the outputs are return in a pandas DataFrame

PS: some test can be found in tests->test_validation

jesellier avatar Apr 07 '20 22:04 jesellier

Estimators

Probabilistic Estimators Design : All probabilisic estimators inherits from a common base asbstract class "ProbabilisticEstimators" (that itself inherits from scikit learn BaseEstimator). The base class specify the methods 'predict' (that return a user define point estimate associated with the distribution of the probabilistic prediction) and 'predict_proba' (that return the predictive distribution directly).

The concrete '2-step' probabilistic estimator It has been segregated into three different inter-connected sub-estimators :

  • The ParametricEstimator (skpro->estomators->parametrics) serves as the main probabilistic estimator class.
  • The ResidualEstimator (skpro->estomators->residuals) is a classical estimator used to estimate the residual part separately.
  • The ClippedEstimator (skpro->estomators->clipped) is a wrapping estimator used for clipping the residualEstimator. It replaces predictions of the wrapped estimator that fall below a specified minimum threshold.

The main ParametricEstimator is initialized with a classical 'mean_estimator' and a classical 'ResidualEstimator' object as member (eventualy wrapped-up by a ClippedEstimator). Since the residual estimator must use the fitted mean_estimator for its own fitting, the two must be linked : The mean_estimator is thus also shared/passed to the residual estimator within the 'ParametricEstimator' init method scope. It is done through a 'call' overload method (acting as a set method) declared in the abstract base of all residual estimators. Then the Parametric 'fit' method works by fiting the mean estimator and then the residual one (that is thus making use of the mean one already fitted)

The losses and the distribution type are finally defined in a modular way :

  • The residual estimator takes as argument a classical point estimator and a classical loss_function functor (squared loss by default)
  • The parametric estimator takes as argument a skpro.distribution object used to specify its distribution return type (by recalling its class contructor with some newly trained parameters).

jesellier avatar Apr 07 '20 23:04 jesellier

Others

The other parts of the code (Bagging, Baselines, Empirical densities) have been adapted to the change :

For Bagging : A probabilistic estimator 'ProbabilisticBaggingRegressor' have been added in (skpro->ensemble) with its specific 'BaggedDistribution' predict return type.

The refactored Empirical Density wrap-up estimator can be found in (skpro->baseline->density_baseline) base line.

jesellier avatar Apr 07 '20 23:04 jesellier

makes a lot of sense. I have no major comments or criticism at this point in time. (except typos: you mean oop, not ood, right?)

As a proposal/issue content, this is also very nicely worked out.

I would hence suggest to create a wiki, and put all the above in a new page, skpro v2 design documents or similar, so it doesn´t get forgotten once this issue is closed.

fkiraly avatar Apr 11 '20 11:04 fkiraly

Hi @jesellier,

Nice proposal, thanks for writing this up. This seems like a natural evolution of skpro.

I like the idea to make the loss functions callable objects. Once the abstract interface is in place it might be worth to revisit issue #2?

I have three minor comments on the distribution diagram:

  1. In the DPQRMixin you proposed a implied method, which seems to be the method that implements the actual method, is that right? If these methods are implementation only and should not be called by users, I'd suggest to make it private, e.g. _pdf() etc. That would avoid the name clash with the public pdf() so there would be no need for the _imp suffix.
  2. Correct me if I'm wrong but it looks like the DPQRMixin contains no concrete implementation but is just an abstract interface description. If so, why is it a Mixin and not just a direct declaration in the DistributionBaseClass? Or are you planning to provide different concrete implemented Mixins that inherit from the interface and that users can include based on their needs?
  3. We once discussed in #3 whether point makes sense or it should just be mean to avoid confusion. Might be worth renaming that during the refactoring.

Let me know if I can help with anything. I'm happy to contribute or review.

frthjf avatar Apr 11 '20 13:04 frthjf

Hello @frthjf

Thanks for you comment. To answer your points :

  1. It's a valid one, it should indeed be private. I can easily add it. For the suffix 'impl' it is actually a common one that is used when separating interface to implementation (at least in C++ or where I worked). I just naturally picked it up.

  2. It can equally be put directly into the distribution base. I just created the Mixin to alleviate the base class content. Since as such it is just a group of similar pure abstract methods with no real information conveyed (instead of all in one file).

  3. Both should be OK. But I can see some advantage in using the 'point' naming convention. It gives the user (at no cost) the flexibility to choose what the point prediction be for each distribution (that might be other than the mean for some). An alternative can be to get a concrete point in the BasicStatsMixin that return the abstract 'mean' (i.e. default implementation). Then 'point' can be overriden in every distribution subclass if the point prediction is not the 'mean'. As such we get the best of both world (flexibility and not having to specifiy the point for every distribution class).

Actually, if you have time, I can send you a code review request before comitting. That would be really helpful. We can also organize a discussion for future roadmap.

jesellier avatar Apr 18 '20 16:04 jesellier

Thanks!

  1. That's interesting, either way it will be fine
  2. Makes sense.
  3. I think that's a good way to put, although, there is some cost to it, e.g. users being confused and no automatic compatibility with np.mean (see #3). However, it's still a valid choice, so I'm fine with using point(). Following this logic, should we also use deviation as abstract and std as concrete implementation?

Yes, I'd be happy to review and discuss if that's helpful.

frthjf avatar Apr 19 '20 17:04 frthjf

Re 3 - One could give distributions a location and scale method, which are distribution-specific, besides mathematically defined statistics such a mean, median, variance, etc. I would not call anything point since it implies a prediction, which is not a property of the distribution object itself.

It is interesting to think about what the default should be, in reduction (or probabilistic to deterministic). I thought that should be the mean (since predictive conditional mean minimizes MSE), but one could also choose it to be location for semantics, or median for consistency and for making sure it always exists.

fkiraly avatar Apr 19 '20 21:04 fkiraly

I think having location/scale as generic and semantically vague default and optional derived mean/variance that may or may not exist is a good idea.

frthjf avatar Apr 20 '20 20:04 frthjf

But not all distributions can be fully parametrized by two location/scale parameters. Not sure it should be pushed to the distribution base class level.

And actually I have been using a mixin class 'LocationScaleMixin' (see distributions->location_scale.py) for the location-scale family distribution (i.e. normal and laplace so far that are both instantiated with loc and scale parameters). Right now it is mainly use for the '2 step estimator' : 1. To ensure that the pre-specified returned distribution type is indeed a location-scale one (i.e. inheriting from this mixin). 2. To convert the std estimate (from the residuals) into a scale parameter (one to one for normal but different for laplace). It can be developped a bit more (ex. with loc and scale get method declared at this mixin level).

For the point method, an alternative might be to simply remove it here. The point prediction type can be directly encoded in the ProbabilisticEstimator base class (i.e. predict method). We can imagine passing a string that specifiy the return type ('mean, median, loc') - then condition apply depending of the distribution (since not all might have a loc method for exemple). That's just a proposition.

jesellier avatar Apr 21 '20 08:04 jesellier

I think removing would weaken the purpose of the distribution interface. After all, you want to type my_distribution. and have your IDE autocomplete my_distribution.mean() for you. How about we just define all of them as

def location():
    raise NotImplementedError()
def mean():
    raise NotImplementedError()
...

and let the user decide which one to implement. When interacting with the distribution we can just operate on it in pythonic EAFP style:

try:
   return dist.mean()
except NotImplementedError:
   return dist.location()

It's probably best to also introduce a NotExistingError() for the user to indicate that the mean actually does not exist.

frthjf avatar Apr 21 '20 19:04 frthjf

But not all distributions can be fully parametrized by two location/scale parameters. Not sure it should be pushed to the distribution base class level.

Of course, expecially mixtures and such. If we go with location/scale methods, I would still strongly argue to have the actual parameterization in the hyper-parameters, and location/scale not necessarily reflecting a complete parameterization.

Though maybe location/scale methods are not a good idea after all? Since not all distributions will have one that is well-defined (e.g., what even is that for a mixture).

After all, you want to type my_distribution. and have your IDE autocomplete my_distribution.mean() for you.

How would that auto-complete even work, @frthjf? Based on what would one compute the mean?

It's probably best to also introduce a NotExistingError() for the user to indicate that the mean actually does not exist.

Yes, good idea to distinguish this, or more precisely between:

  • does not exist (theoretically)
  • is implemented, and analytically or numerically exact
  • is implemented, but may not be exact, e.g., by auto-complete or numerical imputation

fkiraly avatar Apr 21 '20 19:04 fkiraly

How would that auto-complete even work, @frthjf? Based on what would one compute the mean?

If the method is defined by the base class, it would be autocompleted via __dict__ lookup (at least in jupyter notebooks). If you can't compute the mean because the distribution does not have one, the call just triggers the exception and you get an error, simple as that.

Yes, good idea to distinguish this, or more precisely between:

* does not exist (theoretically)

* is implemented, and analytically or numerically exact

* is implemented, but may not be exact, e.g., by auto-complete or numerical imputation

Good point! That's nice because the errors become automatically descriptive and can be handled properly. For example, it would be very easy to fall back on approximation if analytical solution does not exists, like

from skpro import exceptions as E
try:
   dist.mean()
except E.NotExisting:
   raise AttributeError('Though luck my friend')
except E.NotExact:
    print('warn: falling back on approximation')
    approximateMean(dist)

frthjf avatar Apr 21 '20 20:04 frthjf

@frthjf actually I had in mind to remove the point method only not the other ones. This is what I commented above :

class BasicStatsMixin():
    @abc.abstractmethod
    def mean(self):
        raise NotImplementedError()

    @abc.abstractmethod      
    def std(self):
        raise NotImplementedError()
[...]

class DistributionBase(BaseEstimator, BasicStatsMixin, DPQRMixin):
[...]

class LocationScaleMixin():
   @abc.abstractmethod     
   def scale(self):
             raise NotImplementedError()

  @abc.abstractmethod   
   def loc(self):
             raise NotImplementedError()
    
    @classmethod    
    def varianceToScale(cls, variance):
             raise NotImplementedError()


class Normal(DistributionBase, LocationScaleMixin) :
   def __init__(self, loc = 0.0, scale = 1.0):
        self.loc = loc
        self.scale = scale 
        [...]

   def mean(self):
        return self.loc
    
   def std(self):
        return self.scale

   def variance(self):
        return np.square(self.scale)
   
   def loc(self):
        return self.loc

  def scale(self):
        return self.scale

   @classmethod    
    def varianceToScale(cls, variance):
       return np.sqrt(variance)

jesellier avatar Apr 21 '20 21:04 jesellier

Then having a class defined we can restrict the 2 step estimator to them:

class ParametricEstimator(ProbabilisticEstimator):
    def __init__(self, mean_estimator, dispersion_estimator, 
                 distribution = Normal(), ...) :
        
       if not isinstance(distribution, LocationScaleMixin):
               raise ValueError("Unknown strategy type : expected a location_scale distribution")
       self.distribution = distribution

jesellier avatar Apr 21 '20 21:04 jesellier

Ok, makes sense! I'm still not fully convinced of the Mixin approach. I prefer a ducktyping approach over explicit declaration, but I think that's just a matter of taste, so happy to go with that.

frthjf avatar Apr 23 '20 16:04 frthjf

I´m actually more in favour of explicit typing here, since functionality will be type or property dependent - I do think mixins add a lot here without becoming too unpythonic. This is very similar in flavour to the sklearn RegressorMixin etc, I think.

fkiraly avatar Apr 23 '20 17:04 fkiraly

Actually following our last discussion @fkiraly , I am wondering if it not better to delegate all the distributions to Tensorflow-Probabilities. They already have a full framework in place that seems well done (in functionality and range of existing distributions) and we have then an immediate compatibilities for Neural-Network based estimators (without any need for wrap-ups). It should not be too long to adapt the rest to it.

see : https://github.com/tensorflow/probability/blob/master/tensorflow_probability/examples/jupyter_notebooks/A_Tour_of_TensorFlow_Probability.ipynb

jesellier avatar May 05 '20 19:05 jesellier

yes, it looks mature enough - I remember we had this discussin off-line. Especially it supports vectorization which we need.

To recall, I think the problems are:

  • does not inherit from sklearn estimators, therefore composites may get messy
  • unclear how easy it is to extend them with methods that tf probability does not implement, such as integrated cdf
  • even if it were easy, we introduce a dependency on the entirety of tensorflow probability, the distributions are not their own package (as it should be!)

That´s not a complete "no", but there is a trade-off. Where this kind of discussion is best had in a design phase (rather than after having implemented things...)

fkiraly avatar May 05 '20 20:05 fkiraly

I've been working with TFP for a while and I agree it has a good feature set (after all, PyMC switched to it as a backend as well). One big benefit would be that we would get hardware acceleration for free and it would also save a lot of reinventing the wheel. So if the question is whether we should delegate the low-level functionality to Tensorflow, I think that's a good idea. When it comes to the high-level API, however, Tensorflow can get indeed messy very easily. The upside is that there is lots of room for skpro to provide value beyond just being a wrapper if we manage to abstract away some of TF's messiness. I'd be interested to work on this kind of problems although I suspect it's not going to be easy.

frthjf avatar May 05 '20 21:05 frthjf

Following our discussion by mail, I have coding a skeleton with tensor flow probabilities that seems to work.

Ex : For the base distribution class : -> inherit from sklearn baseEstimator and tensor flow Distribution -> pass a tensor flow distribution by composition (member : _dist)


class supportType(Enum):
    UNDEFINED = 0
    DISCRETE = 1
    CONTINUOUS = 2
    MIXED = 3
        
class outputType(Enum):
    ND_ARRAY = 0
    TF_TENSOR = 1
    

class DistributionBase(BaseEstimator, tfd.Distribution):

    def __init__(self, dist, stype = supportType.UNDEFINED):
        
        if not isinstance(dist, tfd.Distribution):
            raise ValueError('Invalid composite distribution %s for distribution %s. '
                                 'must be of type tensorflow_probability.distribution`.' %
                                 (dist, self))
        self._dist = dist
        self._stype = stype
        self._otype = outputType.ND_ARRAY
        
    def dist(self) : 
        return self._dist

Then we can have some get_params and set_params that acts with/to the _dist

   def get_params(self, index = None, deep=True, output_tensor = False):
        out = dict()

        #duplicate sk.learn distlementation
        for key, value in self._dist.parameters.items() :
            value = self._convert_to_array(value)
                
            if index is not None \
               and not isinstance(value, str) \
               and hasattr(value, "__len__") \
               and hasattr(value, '__getitem__') : 
                  out[key] = value[index]
            else :
                out[key] = value

        return out
    
    def set_params(self, **params):
        
        if not params:
            # Sdistle optimization to gain speed (inspect is slow)
            return self
        
        valid_params = self.get_params(deep=True)
        
        for key, value in params.items():
            if key not in valid_params:
                raise ValueError('Invalid parameter %s for estimator %s. '
                                 'Check the list of available parameters '
                                 'with `estimator.get_params().keys()`.' %
                                 (key, self))
                
        dic = dict(self._dist.parameters, **params)
        self._dist = type(self._dist)(**dic)

        return self

And I override the tf.Distribution methods : -> just few below as exemple (there is much more available) -> I added an automatic output conversion to array (vs tf.tensor type) that is controled by the self._otype member

def _convert_to_array(self, arg):
        if isinstance(arg, tf.Tensor):
            if self._otype is outputType.ND_ARRAY :
                arg = arg.numpy()
        return arg

    # accessors
    @property
    def name(self):
        return self._dist.name

    @property
    def sample(self, sample_shape=(), seed=None, name='sample', **kwargs):
        return self._dist.sample(sample_shape=sample_shape, seed= seed, name= name, **kwargs)

    def prob(self, value, name='prob', **kwargs):
        out = self._dist.prob(value = value, name= name, **kwargs)
        return self._convert_to_array(out)

    def cdf(self, value, name='cdf', **kwargs):
         out = self._dist.cdf(value = value, name= name, **kwargs)
         return self._convert_to_array(out)

    def entropy(self, name='entropy', **kwargs):
         out =  self._dist.entropy(name= name, **kwargs)
         return self._convert_to_array(out)
        
    def mean(self, name='mean', **kwargs):
        out =  self._dist.mean(name= name, **kwargs)
        return self._convert_to_array(out)

jesellier avatar May 14 '20 12:05 jesellier

We can that add Mixins. See below with the normal distributions (add two mixins adding the L2-norm and the Loc-Scal methods):

class DPQRMixin():
    
    @abc.abstractmethod
    def squared_norm(self):
        raise NotImplementedError()
        
class LocationScaleMixin():
    
    @property
    def loc(self) :
        raise NotImplementedError()
    
    @property
    def scale(self) :
        raise NotImplementedError()
    
    @classmethod    
    def varianceToScale(cls, variance):
         raise NotImplementedError()

        
        

  
class Normal(DistributionBase, DPQRMixin, LocationScaleMixin):
    
    def __init__(self,
               loc = 0,
               scale = 1,
               validate_args= False,
               allow_nan_stats= True,
               name='Normal'):

        dist = tfd.Normal(
                loc,
                scale,
                validate_args,
                allow_nan_stats,
                name
                )
        
        super(Normal, self).__init__(dist,  stype = supportType.CONTINUOUS)
        
        @property
        def loc(self):
            return self._dist.loc
        
        @property
        def scale(self):
            return self._dist.scale
        
        
        def squared_norm(self):

         scale = tf.convert_to_tensor(self._dist.scale)
         sqrt_pi = tf.constant(np.sqrt(np.pi))
         out = 1.0 /(tf.math.pow(2.0, scale) * sqrt_pi)
         
         return out

jesellier avatar May 14 '20 12:05 jesellier

Overall, the code is much lighter for pretty much same functionalities. And we get all tesor flow additional functionalities for free (including AAD).

jesellier avatar May 14 '20 12:05 jesellier

superseded by https://github.com/sktime/skpro/issues/10

fkiraly avatar May 11 '23 16:05 fkiraly