probatus icon indicating copy to clipboard operation
probatus copied to clipboard

Remove metric volatility module

Open Matgrb opened this issue 3 years ago • 3 comments

Issue Description Currently, the metric volatility module implements classes, which allow for assessing the performance volatility if you split the into train/test data with different seeds. However, I don't see a major difference from just:

cv = RepeatedCV
cross_val_score(clf, cv)

or

cv = StratifiedShuffleSplit(n_splits=100)
cross_val_score(clf, cv)

Also, the way it can be used it to measure the stability of the results for different sizes of test data, but that is exactly what is done by learning curve in sklearn.

Do you see any other use cases for it, and a reason to keep it? @timvink @sbjelogr

On top of it, i don't think the module is being used by users. I think it would be best to completely remove the module from probatus and decrease the maintenance workload.

Matgrb avatar Apr 15 '21 07:04 Matgrb

Hi @Matgrb , I still see the value of keeping the Volatility estimators, but I do believe that a code refactor is needed.

Indeed all the fit methods of the different volatility estimators can be replaced by the code snippets you propose (with somee adjustments). But besides wrapping the code around sklearn, the metric volatility also returns all the stats of the different distributions and their visualizations. By using the sklearn CV or Split classes, you can also compute the uncertainty of multiple metrices in one go, which would be a nice addition.

Potential add-ons to volatility estimators

On top of this, if you use cross_validate instead of cross_val_score, and return the estimators, you can do quite some nice bias-variance estimations of the different models. As example, the first thing that comes up to my mind, is to compute the cosine similarities between the coefficients of all the LogisiticRegression models fitted during the cross_validatation step: this gives you a way to measure by how much your models are changing with varying data: in the best case scenario, the cosine similarity would ways be 1, and the lower the value the more your models are "rotating in the feature space" (in other words, you have a higher variance effect due to the data size) Probably a similar evaluation could be done with Tree Ensembles (maybe by doing a similar test on feature_importance or shap values), but in this context the geometrical interpretation does not hold anymore :)

Learning cure in sklearn

Note that learning curves do something slightly different. If you look at the learning curve code, it keeps the test-fold of the same size, and uses different (increasing) percentages of the training data. See the snippet of code from the sklearn github

for train, test in cv_iter:
    for n_train_samples in train_sizes_abs:
        train_test_proportions.append((train[:n_train_samples], test))

In other words, part of the uncertainty (as well as the shape of the curves) is due to the fraction of the train-fold-data used for training the classifiers.

sbjelogr avatar Apr 15 '21 09:04 sbjelogr

The reason we have metric_uncertainty in probatus is because reporting uncertainty in model performance is often overlooked. It's an important part of "validate binary classification models and the data used to develop them".

That said, maintainability is definitely an issue and I do see why you would like to remove it. And while I really like the idea of cosine similarities, I am not in favour of adding more functionality, but rather seeing if we can simplify it further (see vision).

I don't want to remove it entirely, but I do think we can simplify it considerably: At its core, we're trying to assess how much luck we had with our model results, how much randomness was at play. I don't find the difference between SplitSeedVolatility and TrainTestVolatility obvious from the documentation. Basically you want to vary the model seed and the split seed. If we let the user define a clf, an appropriate cv and optionally some list of scorers, then we could build something like this:

cv = <your cv>
clf = <your model>

vol = MetricVolatility(clf, cv)
vol.fit_compute(X, y, n_repeats = 10)

# where .fit() would be something like
metrics = []
for _ in range(n_repeats):
    # set clf random_state. of course in practice this part would be more robust, checking if the attribute exists, etc
    clf.random_state = int(random.random())
    for train_idx, test_idx in cv.split(X, y):
         # fit, predict, performanc emetrics
         metrics['auc'].append(auc)

Then vol.plot(metric='auc') would just show the distribution of AUCs.

For varying the training_size, that should probably not be part of the metric volality, but part of a different assessment where you consider if you have enough data for your problem. The learning curve seems like a good fit for that, but it was already decided not to wrap it in probatus (see https://github.com/ing-bank/probatus/issues/88)

Of course I would also be willing to pick up building this somewhere in next couple of weeks.

timvink avatar Apr 15 '21 12:04 timvink

I see the point of keeping it, but indeed we need to keep it as simple as possible. I would say if we indeed want to keep it i think we should plan this feature in minimum requirements way first.

I see two options:

  • User passes CV parameter - number of folds, and we apply RepeatedCV or cv multiple times with different seed on this number of folds and n_iterations. Optionally, if user passes CVgenerator with set random seed already we give a warning that we will overwrite it in subsequent runs. The returned report is information about different runs, and of CV, and plots can be similar as now. We also need to think about kwargs** parameter, what the user can pass to model.fit() or to the CV, to provide as much flexibility as possible.
  • User passes size of test split and we apply StratifiedShuffleSplit with n_iter, and again run cross validation. In this case we probably need to also allow passing kwargs** in init to this method. This option can be also already incorporated in the option one, if we document it well, and provide tutorial how to do it for the user. It can also be something we add later, but already good to think about it.

I am in favour of starting with option one, with as much flexibility that we can give to the user, and as little code as possible.

Features to drop completely.

  • bootstrapping and sampling of each subset
  • stats tests between the different folds results

Parameters to add to init to be consistent with other modules:

  • scoring - accepting list?
  • verbose
  • n_iter
  • cv
  • n_jobs All these parameters should be implemented in as similar way as possible to other modules. My main concern is that we will need much more code+docs to cover the parameters, than actual code that executes the metric volatility.

I think you can start designing the feature and see how much complexity it has. Let's reevaluate it based on the initial design somewhere next week. What do you think?

Matgrb avatar Apr 16 '21 07:04 Matgrb

This module is dropped in favor of maintainability.

ReinierKoops avatar Mar 17 '24 21:03 ReinierKoops