probatus icon indicating copy to clipboard operation
probatus copied to clipboard

Evaluate metrics over time

Open gverbock opened this issue 3 years ago • 10 comments

Problem Description It would be great if Probatus would give the metric (including volatility) over time so that eventual drops in model performance can be spotted easily. The time aggregation level (day, month, quarter) would be chosen by the user.

Desired Outcome The output would be a dataframe containing the following columns: dates, metric1, metric2. The input could then be used for a plot like the following: image Also possibility to evaluate for out-of-time would be required.

Solution Outline Maybe incorporate it in the metric_volatility class. Passing a series with the date to aggregate and using groupby before computing the metrics.

gverbock avatar Feb 10 '21 17:02 gverbock

I'm not sure, but is this something that can be done with popmon?

operte avatar Feb 15 '21 15:02 operte

Have you thought about how a potential API would look like (pseudo code)?

timvink avatar Feb 15 '21 16:02 timvink

I think this could be done by extending BaseVolatilityEstimator and implementing something similar to TrainTestVolatility with one crucial difference:

When you split the data into train and test you take into account time column:

  • stratify split based on time column, this allows to have train and test samples from the entire time duration. Repeating this split multiple times allows to plot volatility of the Out-of-sample split volatility over time
  • split data into multiple time-based folds. At each test on one fold and train on remaining folds. In order to get time based volatility you can apply bootstrapping on train and test folds. This will basically tell you if you how a given time-based fold is different and volatile when predicted based on a model trained on other folds.
  • split data into multiple time-based folds. Then apply the schema as shown in the image below. In order to get the volatility you can again apply bootstrapping on train and test folds. This will tell you how volatile the model is with OOT splits, and how much data you need for training to have a stable OOT result.

image

The first option seems easiest to implement reusing most of the existing code. For the remaining two it would be more difficult. I suppose the first one would be a good starting point.

In all cases, you would need to consider a new plotting method, that would be similar for all time-based metric volatility. You could also make another base class BaseTimeBasedVolatilityEstimator, which overwrites the plot method of BaseVolatilityEstimator.

Regarding use of popmon, we could try to use it for plotting, however, i think this is a minor part of the feature, and we could get a more efficient implementation if we do it ourselves.

Matgrb avatar Feb 16 '21 10:02 Matgrb

My thoughts were to start simple:

Having something like

class PerformanceOverTimeEstimator(model, X, y, scorer_list, dates, frequency)       
       
def boosting_process(self,...) 
            X_proba = model.predict_proba(X)
       
          for boost in range(0, 1000)
               X_boost, Y_boost = time_stratified_sampling(X, y, dates, frequency)
              scores = compute_scores_over_time(X_boost, Y_boost, dates, frequency, scorer_list)
             result.append(score)

     def plot_results

     def results_as_table

I had in mind to have a fitted model as argument so that hyperparmeters optimization is done outside the class

gverbock avatar Feb 17 '21 10:02 gverbock

Possible improvements:

  • X_proba could be computed using cross validation using cross_val_predict, to ensure there is no leakage.
  • Let's try to stick to the probatus API: init (clf, metrics, ...), fit(X, y, ...), compute(metrics, ...), plot().
  • The clf provided by the user can be model or a model wrapped into GridSearchCV that will perform hyperparameter optimization at each training. So you don't have to worry too much about the hyperparam opt.

What would the time_stratified_sampling and compute_scores_over_time do? Also what would the frequency parameter do?

What would be use case for using this code? Could you provide example what this analysis tells you about the model/data?

Matgrb avatar Feb 17 '21 12:02 Matgrb

Good points Mateusz.

  • Frequency would provide the level of aggregation over time. Monthly, quarterly, ...
  • time_stratified_sampling would ensure the bootstrap is homogeneously distributed across time.
  • compute scores_over_time would compute the score_metrics for each unit of time (month, quarter).

The benefit of the new code is that the user sees (for example) the AUC over time and can easily spot performance degradation. For example specific months (say Covid, summer holidays, ....).

This helps you to either assess the impact of unexpected changes (like for example Covid, crisis, bad publicity) but also understanding the reason may reveal some weaknesses of the model. For example the model starts deteriorate once the mortgage production increased. Then you would try to mitigate this by adding features related to mortgages.

gverbock avatar Feb 17 '21 12:02 gverbock

So to summarize:

  1. Compute probabilities for the entire X using Cross-Validation
  2. Split data into time buckets
  3. For each window of data, randomly sample examples multiple times (bootstrapping), and measure the metric e.g. AUC multiple times.
  4. Compute a plot and report about volatility of the metric in each time bucket

Is that correct?

I like the approach for the simplicity. It provides you information for which periods of time to be cautious, and possible data drifts. It is similar to another issue #72 but it focuses on how the performance of target prediction changes over time.

The limitation I see is that when you compute the probabilities for X using CV, the model is trained on the data from the entire time span. Imagine you have a sample in the middle of the dataset, the model has seen samples before and after that.

Let's also ask others what do they think? @timvink @anilkumarpanda @operte

Matgrb avatar Feb 17 '21 13:02 Matgrb

You understood it correctly.

I am not sure the limitation you raise on the cross-section would have a large impact.

gverbock avatar Feb 17 '21 13:02 gverbock

Indeed probably low impact 👍

However, I would reach out to a couple of users and see if they would find it useful for their projects.

Matgrb avatar Feb 17 '21 13:02 Matgrb

Is this still a feature that we want to work on @gverbock @anilkumarpanda ?

ReinierKoops avatar Mar 17 '24 21:03 ReinierKoops