probatus
probatus copied to clipboard
Remove metric volatility module
Issue Description Currently, the metric volatility module implements classes, which allow for assessing the performance volatility if you split the into train/test data with different seeds. However, I don't see a major difference from just:
cv = RepeatedCV
cross_val_score(clf, cv)
or
cv = StratifiedShuffleSplit(n_splits=100)
cross_val_score(clf, cv)
Also, the way it can be used it to measure the stability of the results for different sizes of test data, but that is exactly what is done by learning curve in sklearn.
Do you see any other use cases for it, and a reason to keep it? @timvink @sbjelogr
On top of it, i don't think the module is being used by users. I think it would be best to completely remove the module from probatus and decrease the maintenance workload.
Hi @Matgrb , I still see the value of keeping the Volatility estimators, but I do believe that a code refactor is needed.
Indeed all the fit methods of the different volatility estimators can be replaced by the code snippets you propose (with somee adjustments).
But besides wrapping the code around sklearn, the metric volatility also returns all the stats of the different distributions and their visualizations.
By using the sklearn CV
or Split
classes, you can also compute the uncertainty of multiple metrices in one go, which would be a nice addition.
Potential add-ons to volatility estimators
On top of this, if you use cross_validate
instead of cross_val_score
, and return the estimators, you can do quite some nice bias-variance estimations of the different models.
As example, the first thing that comes up to my mind, is to compute the cosine similarities between the coefficients of all the LogisiticRegression models fitted during the cross_validatation step: this gives you a way to measure by how much your models are changing with varying data: in the best case scenario, the cosine similarity would ways be 1, and the lower the value the more your models are "rotating in the feature space" (in other words, you have a higher variance effect due to the data size)
Probably a similar evaluation could be done with Tree Ensembles (maybe by doing a similar test on feature_importance or shap values), but in this context the geometrical interpretation does not hold anymore :)
Learning cure in sklearn
Note that learning curves do something slightly different. If you look at the learning curve code, it keeps the test-fold of the same size, and uses different (increasing) percentages of the training data. See the snippet of code from the sklearn github
for train, test in cv_iter:
for n_train_samples in train_sizes_abs:
train_test_proportions.append((train[:n_train_samples], test))
In other words, part of the uncertainty (as well as the shape of the curves) is due to the fraction of the train-fold-data used for training the classifiers.
The reason we have metric_uncertainty
in probatus
is because reporting uncertainty in model performance is often overlooked. It's an important part of "validate binary classification models and the data used to develop them".
That said, maintainability is definitely an issue and I do see why you would like to remove it. And while I really like the idea of cosine similarities, I am not in favour of adding more functionality, but rather seeing if we can simplify it further (see vision).
I don't want to remove it entirely, but I do think we can simplify it considerably: At its core, we're trying to assess how much luck we had with our model results, how much randomness was at play. I don't find the difference between SplitSeedVolatility
and TrainTestVolatility
obvious from the documentation. Basically you want to vary the model seed and the split seed. If we let the user define a clf
, an appropriate cv
and optionally some list of scorers
, then we could build something like this:
cv = <your cv>
clf = <your model>
vol = MetricVolatility(clf, cv)
vol.fit_compute(X, y, n_repeats = 10)
# where .fit() would be something like
metrics = []
for _ in range(n_repeats):
# set clf random_state. of course in practice this part would be more robust, checking if the attribute exists, etc
clf.random_state = int(random.random())
for train_idx, test_idx in cv.split(X, y):
# fit, predict, performanc emetrics
metrics['auc'].append(auc)
Then vol.plot(metric='auc')
would just show the distribution of AUCs.
For varying the training_size, that should probably not be part of the metric volality, but part of a different assessment where you consider if you have enough data for your problem. The learning curve seems like a good fit for that, but it was already decided not to wrap it in probatus (see https://github.com/ing-bank/probatus/issues/88)
Of course I would also be willing to pick up building this somewhere in next couple of weeks.
I see the point of keeping it, but indeed we need to keep it as simple as possible. I would say if we indeed want to keep it i think we should plan this feature in minimum requirements way first.
I see two options:
- User passes CV parameter - number of folds, and we apply RepeatedCV or cv multiple times with different seed on this number of folds and n_iterations. Optionally, if user passes CVgenerator with set random seed already we give a warning that we will overwrite it in subsequent runs. The returned report is information about different runs, and of CV, and plots can be similar as now. We also need to think about kwargs** parameter, what the user can pass to model.fit() or to the CV, to provide as much flexibility as possible.
- User passes size of test split and we apply
StratifiedShuffleSplit
with n_iter, and again run cross validation. In this case we probably need to also allow passing kwargs** in init to this method. This option can be also already incorporated in the option one, if we document it well, and provide tutorial how to do it for the user. It can also be something we add later, but already good to think about it.
I am in favour of starting with option one, with as much flexibility that we can give to the user, and as little code as possible.
Features to drop completely.
- bootstrapping and sampling of each subset
- stats tests between the different folds results
Parameters to add to init to be consistent with other modules:
- scoring - accepting list?
- verbose
- n_iter
- cv
- n_jobs All these parameters should be implemented in as similar way as possible to other modules. My main concern is that we will need much more code+docs to cover the parameters, than actual code that executes the metric volatility.
I think you can start designing the feature and see how much complexity it has. Let's reevaluate it based on the initial design somewhere next week. What do you think?
This module is dropped in favor of maintainability.