mljar-supervised Can not call need_retrain when best model is Ensemble

When the best model is Ensemble, and you reload the model from a results path, calling need_retrain on the model results in the following error

File "/Users/zachtindall/Documents/repos/ml-build/.venv/lib/python3.11/site-packages/supervised/automl.py", line 565, in need_retrain
    return self._need_retrain(X, y, sample_weight, decrease)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zachtindall/Documents/repos/ml-build/.venv/lib/python3.11/site-packages/supervised/base_automl.py", line 2479, in _need_retrain
    metric = self._best_model.get_metric()

If I call load manually on the model after it has been initialized, then it works as intended.

Example

        auto_ml = AutoML(
            results_path=self._results_path,
        )

        auto_ml.load(self._results_path)

        need_retrain = auto_ml.need_retrain(
            x_test, y_test, sample_weight=sample_weight, decrease=decrease
        )

I wouldn't expect that load would need to be called manually, since other methods like predict can be called right after initializing a model from a results path

Jun 04 '25 16:06 a88zach

Thank you @a88zach for reporting the issue. Looks like a bug in need_retrain.

Were you able to create models for your use case? Are you happy with AutoML results?

Jun 05 '25 07:06 pplonski

@pplonski very happy so far with the results. The resulting model has been much better so far than AutoGluon and H2O. There are a few missing features, but overall 🥇

Jun 05 '25 15:06 a88zach

Thanks @a88zach , what features would you like to see in MLJAR?

Jun 06 '25 07:06 pplonski

@pplonski 2 that come to mind:

Our data is stratified and this forces us to us a custom cv strategy to split the data into train/validation data. It would be nice to have a property so the group shuffle split can be done by the library

Our current workaround

    def _generate_cv_splits(
        self, df: "DataFrame"
    ) -> List[Tuple[np.ndarray, np.ndarray]]:

        splitter = GroupShuffleSplit(
            n_splits=self._num_splits,
            test_size=self._test_size,
            random_state=self._random_state,
        )
        split = splitter.split(df, groups=df["container_cycle_id"])

        cv: List[Tuple[np.ndarray, np.ndarray]] = []
        for i, (train_idx, validation_idx) in enumerate(split):
            cv.append((train_idx, validation_idx))

        return cv

Making the model production ready. We create many small models and serve them with google cloud run. The current size of the results path can get pretty large depending on allowed training time, but most of the contents are not needed to predict.

Our current workaround is posted here https://github.com/mljar/mljar-supervised/issues/405#issuecomment-2701828030

Jun 06 '25 14:06 a88zach

mljar-supervised mljar-supervised copied to clipboard

Can not call need_retrain when best model is Ensemble

mljar-supervised
mljar-supervised copied to clipboard