h2o-3
h2o-3 copied to clipboard
GBM algorithm with H2O AutoML fails on small dataset
With H2O==3.44.0.1, the GBM algorithm within the H2O AutoML function fails for a small dataset. Following is the error message I see:
AutoML progress: | | 0% 18:23:48.910: _min_rows param, The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 111.0.
AutoML progress: |███████████████████████████████████████████████████████████████████████████████████ (failed)| 100%
18:58:48.594: GBM_grid_1_AutoML_1_20240315_182348 [GBM Grid Search] failed: java.util.NoSuchElementException: No more elements to explore in hyper-space!
Traceback (most recent call last):
File "gbm_trial.py", line 27, in
Closing connection _sid_b08a at exit H2O session _sid_b08a closed.
Following is the code I used:
fr = h2o.create_frame(rows=111, cols=29, real_fraction=1.0, categorical_fraction=0, has_response=True, response_factors=2, seed=12345, missing_fraction=0.0) aml = H2OAutoML(max_runtime_secs=10000, include_algos=["GBM"]) aml.train(x=fr.columns[:-1], y=fr.columns[-1], training_frame=fr) h2o.shutdown()
@magrenimish Thank you for creating this issue and bringing this to our attention. AutoML should have failed with a nicer message, e.g., No model was trained.
. GBM requires more data in order to be trained as mentioned in the warning The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 111.0.
.
@tomasfryda would it be possible to then skip or exclude the GBM algorithm with H2O AutoML without explicitly specifying it with the 'exclude_algos' parameter?
For example: With the following code:
fr = h2o.create_frame(rows=111, cols=29, real_fraction=1.0, categorical_fraction=0, has_response=True, response_factors=2, seed=12345, missing_fraction=0.0) aml = H2OAutoML(max_runtime_secs=10000) aml.train(x=fr.columns[:-1], y=fr.columns[-1], training_frame=fr) h2o.shutdown()
The function fails with GBM, but would it be possible to skip GBM in this case?
@magrenimish that's basically what should happen. AutoML doesn't want to know about underlying constraints of individual models so first each model runs its parameter/training data validation logic and if that fails, the model won't train. The validation logic is also responsible for emitting the warning to inform the user of what went wrong (e.g. The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 111.0.
).
It's hard to exclude automatically whole class of models since each model in AutoML has different parameters and the failures are often dependent on the parameters.
@tomasfryda thank you! So if I want the AutoML function to continue without the GBM algorithm, then I either have to explicitly exclude it with the 'exclude_algos' parameter or catch the specific error and skip the algorithm?
@magrenimish you can just ignore the warning.
When I run your code, I can still get the automl to train and it looks some GBMs have parameters that enable training with low amount of data:
In [3]: fr = h2o.create_frame(rows=111, cols=29, real_fraction=1.0, categorical_fraction=0, has_response=True, response_factors=2, seed=12345, miss
...: ing_fraction=0.0)
In [6]: from h2o.automl import H2OAutoML
In [7]: aml = H2OAutoML(max_runtime_secs=100)
In [8]: aml.train(x=fr.columns[:-1], y=fr.columns[-1], training_frame=fr)
AutoML progress: |▉ | 1%
16:41:20.27: _min_rows param, The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 111.0.
AutoML progress: |███████████████████████████████████████████████████████████████████████████████████ (done)| 100%
In [9]: aml.leaderboard
Out[9]:
model_id rmse mse mae rmsle mean_residual_deviance
------------------------------------------------------------------------ ------- ------- ------- ------- ------------------------
GBM_grid_1_AutoML_1_20240318_164118_model_49 56.1927 3157.62 49.7652 nan 3157.62
GBM_grid_1_AutoML_1_20240318_164118_model_10 56.364 3176.9 49.8405 nan 3176.9
GBM_grid_1_AutoML_1_20240318_164118_model_8 56.4429 3185.8 49.9569 nan 3185.8
GBM_grid_1_AutoML_1_20240318_164118_model_17 56.459 3187.62 50.0887 nan 3187.62
GBM_grid_1_AutoML_1_20240318_164118_model_52 56.472 3189.08 49.9436 nan 3189.08
GBM_grid_1_AutoML_1_20240318_164118_model_21 56.5403 3196.8 49.9926 nan 3196.8
GBM_grid_1_AutoML_1_20240318_164118_model_46 56.6061 3204.25 50.4635 nan 3204.25
StackedEnsemble_BestOfFamily_5_AutoML_1_20240318_164118 56.6463 3208.8 50.668 nan 3208.8
GBM_grid_1_AutoML_1_20240318_164118_model_32 56.8033 3226.62 50.2397 nan 3226.62
XGBoost_lr_search_selection_AutoML_1_20240318_164118_select_grid_model_6 56.8386 3230.63 50.8662 nan 3230.63
[166 rows x 6 columns]