Hyperactive [ENH] forecasting benchmarking task experiment

This PR adds a SktimeForecastingTask, which defines a full benchmarking run for a forecaster that is passed later in _evaluate.

This object could be used as a "task" in the sktime ForecastingBenchmark.

Draft for discussion and reviewing the design:

it is quite similar to and partially duplicative with SktimeForecastingExperiment which is used in tuning. How should we deal with the similarity and intersection?
- we could merge into a single class, depending on whether forecaster gets passed or not. Not sure where that leads though
is this a possible 1:1 dropin (or almost) for the task object in sktime?

Aug 24 '25 14:08 fkiraly

@arnavk23, can you kindly explain what you corrected and why?

Nov 22 '25 11:11 fkiraly

@arnavk23, can you kindly explain what you corrected and why?

Added validation for forecaster in params The original version assumed params["forecaster"] always existed. I added an explicit check and a clear error message because missing/incorrect parameters otherwise raise cryptic errors deeper inside sktime.evaluate.
Made scoring metric handling more robust The previous code assumed that any scoring object implements get_tag("lower_is_better"). I wrapped this in a try/except and added correct defaults for both cases (scoring=None or custom metrics).
Safely applied higher_is_better tag set_tags() was called without handling the case where it fails or is not supported.
Improved parsing of the output from sktime.evaluate() The previous implementation assumed: the result is always a DataFrame the scoring column name is always exactly "test_<scoring.name>" I added: support for both DataFrame-like and dict-like outputs fallback to the first available test_* column if the expected name isn’t present, warnings when fallback happens.
Better error handling during evaluate Previously, any exception inside evaluate() could crash or create inconsistent behavior. Now: error_score="raise" preserves the expected behavior otherwise returns (error_score, {"error": })
Robust conversion of results to a scalar The earlier implementation assumed you can always do float(results.mean()). I added: use of np.nanmean fallback to np.asarray if needed structured error reporting if even that fails

Nov 22 '25 12:11 arnavk23

@arnavk23, is this AI generated?

Nov 28 '25 00:11 fkiraly

@arnavk23, is this AI generated?

Yes the remark is AI-generated.

Nov 28 '25 00:11 arnavk23