mljar-supervised
mljar-supervised copied to clipboard
Can not call need_retrain when best model is Ensemble
When the best model is Ensemble, and you reload the model from a results path, calling need_retrain on the model results in the following error
File "/Users/zachtindall/Documents/repos/ml-build/.venv/lib/python3.11/site-packages/supervised/automl.py", line 565, in need_retrain
return self._need_retrain(X, y, sample_weight, decrease)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/zachtindall/Documents/repos/ml-build/.venv/lib/python3.11/site-packages/supervised/base_automl.py", line 2479, in _need_retrain
metric = self._best_model.get_metric()
If I call load manually on the model after it has been initialized, then it works as intended.
Example
auto_ml = AutoML(
results_path=self._results_path,
)
auto_ml.load(self._results_path)
need_retrain = auto_ml.need_retrain(
x_test, y_test, sample_weight=sample_weight, decrease=decrease
)
I wouldn't expect that load would need to be called manually, since other methods like predict can be called right after initializing a model from a results path
Thank you @a88zach for reporting the issue. Looks like a bug in need_retrain.
Were you able to create models for your use case? Are you happy with AutoML results?
@pplonski very happy so far with the results. The resulting model has been much better so far than AutoGluon and H2O. There are a few missing features, but overall 🥇
Thanks @a88zach , what features would you like to see in MLJAR?
@pplonski 2 that come to mind:
- Our data is stratified and this forces us to us a custom cv strategy to split the data into train/validation data. It would be nice to have a property so the group shuffle split can be done by the library
Our current workaround
def _generate_cv_splits(
self, df: "DataFrame"
) -> List[Tuple[np.ndarray, np.ndarray]]:
splitter = GroupShuffleSplit(
n_splits=self._num_splits,
test_size=self._test_size,
random_state=self._random_state,
)
split = splitter.split(df, groups=df["container_cycle_id"])
cv: List[Tuple[np.ndarray, np.ndarray]] = []
for i, (train_idx, validation_idx) in enumerate(split):
cv.append((train_idx, validation_idx))
return cv
- Making the model production ready. We create many small models and serve them with google cloud run. The current size of the results path can get pretty large depending on allowed training time, but most of the contents are not needed to predict.
Our current workaround is posted here https://github.com/mljar/mljar-supervised/issues/405#issuecomment-2701828030