MLJ.jl Current performance evaluation objects, recently added to TunedModel histories, are too big

There's evidence that the recent addition of full PerformanceEvaluation objects to TunedModel histories is blowing up memory requirements in real use cases.

I propose that we create two PerformanceEvaluation objects - a detailed one (as we have now) and new CompactPerformanceEvaluation object. The evaluate method get's a new keyword argument compact=false and TunedModel gets a new hyperparameter compact_history=true (this default would technically break MLJTuning but I doubt this would effect more than one or two users - and the recent change is not actually documented anywhere yet.)

This would also allow us to ultimately address https://github.com/alan-turing-institute/MLJ.jl/issues/575, which was shelved for fear of making evaluation objects too big.

Further thoughts anyone?

cc @CameronBieganek, @OkonSamuel

Below are the fields of the current struct. I've ticked off suggested fields for the compact case. I suppose the only one that might be controversial is observations_per_fold. This was always included in TunedModel histories previously, so it seems less disruptive to include it.

Fields

These fields are part of the public API of the PerformanceEvaluation struct.

[x] model: model used to create the performance evaluation. In the case a tuning model, this is the best model found.
[x] measure: vector of measures (metrics) used to evaluate performance
[x] measurement: vector of measurements - one for each element of measure - aggregating the performance measurements over all train/test pairs (folds). The aggregation method applied for a given measure m is StatisticalMeasuresBase.external_aggregation_mode(m) (commonly Mean() or Sum())
[x] operation (e.g., predict_mode): the operations applied for each measure to generate predictions to be evaluated. Possibilities are: $PREDICT_OPERATIONS_STRING.
[x] per_fold: a vector of vectors of individual test fold evaluations (one vector per measure). Useful for obtaining a rough estimate of the variance of the performance estimate.
[x] per_observation: a vector of vectors of vectors containing individual per-observation measurements: for an evaluation e, e.per_observation[m][f][i] is the measurement for the ith observation in the fth test fold, evaluated using the mth measure. Useful for some forms of hyper-parameter optimization. Note that an aggregregated measurement for some measure measure is repeated across all observations in a fold if StatisticalMeasures.can_report_unaggregated(measure) == true. If e has been computed with the per_observation=false option, then e_per_observation is a vector of missings.
[ ] fitted_params_per_fold: a vector containing fitted params(mach) for each machine mach trained during resampling - one machine per train/test pair. Use this to extract the learned parameters for each individual training event.
[ ] report_per_fold: a vector containing report(mach) for each machine mach training in resampling - one machine per train/test pair.
[ ] train_test_rows: a vector of tuples, each of the form (train, test), where train and test are vectors of row (observation) indices for training and evaluation respectively.
[x] resampling: the resampling strategy used to generate the train/test pairs.
[x] repeats: the number of times the resampling strategy was repeated.

Apr 17 '24 23:04 ablaom

Also relevant: https://github.com/alan-turing-institute/MLJ.jl/issues/1025

Apr 17 '24 23:04 ablaom

To do:

[ ] https://github.com/JuliaAI/MLJBase.jl/pull/973
[ ] https://github.com/JuliaAI/MLJTuning.jl/pull/215

Apr 24 '24 02:04 ablaom