h2o-3
h2o-3 copied to clipboard
PUBDEV-3911: stackedensemble model summary
Code for JIRA Pubdev-3911 Add model summary for Stacked Ensembles in Python API. The code is based on the existing stacked ensemble model summary in R PUBDEV-5462 Add model summary in Stacked Ensemble R binding.
Code Additions
All code is contained in the model_summary
function located in the h2o-bindings/bin/gen_python.py file. Tests are located in h2o-py/tests/testdir_algos/stackedensemble/pyunit_stackedensemble_modelsummary.py file.
Like the stacked ensemble model summary in R, this code prints base model information and metalearner information.
The base model information includes:
- The number of base models used by the ensemble (int)
- The count of occurrences each base model algorithm type (displayed in H2OFrame)
- If
base_model_detail
=True is set, model summaries for each base model are outputted (displayed in H2OFrames). This is the only code that does not currently exist in the R stacked ensemble model summary.
Metalearner information includes:
- The metalearner algorithm (string)
- If the information exists (i.e. is not None), the number of folds used by the metalearner (int) and the fold assignment
- If the information exists, the parameters of the metalearner
Since the outputted information includes at least one H2OFrame, this function relies on print
statements to display information to the user.
Usage
Using the stacked ensemble example from the stacked ensemble documentation:
# Train a stacked ensemble using the GBM and GLM above
ensemble = H2OStackedEnsembleEstimator(model_id="my_ensemble_binomial",
base_models=[my_gbm, my_rf])
ensemble.train(x=x, y=y, training_frame=train)
# Show ensemble model summary
ensemble.model_summary() # or..
ensemble.model_summary(base_model_detail=True) # additionally outputs summaries of the base models
Other information
Information on system and packages:
- macOS High Sierra Version 10.13.5
- Python 3.7.0
- Package versions:
- colorama==0.3.9
- future==0.16.0
- h2o==3.21.0.99999
- pandas==0.23.1
- requests==2.19.1
- scipy==1.1.0
- tabulate==0.8.2
Local build executed using the following commands on the command line in a virtual environment:
git clone <forked H2O repo>
cd h2o-3:
./gradlew build -x test
gradle installDist
cd h2o-py
python setup.py install
@jdubchak great contribution, thank you!!!
@michalkurka Thank you.
@jdubchak This is great! The output looks good and I think the use of an extra argument, base_model_detail
, is an excellent addition.
I think the only thing that we should change is that the method should be called .summary()
instead of .model_summary()
. Sorry if the JIRA was confusing, when I said "model_summary", I was referring to the R method which is called that. In Python, it's called .summary()
and we already use .summary()
for regular (non-ensemble) models. Example:
In [10]: my_gbm.summary()
Out[10]: Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
-- ----------------- -------------------------- --------------------- ----------- ----------- ------------ ------------ ------------ -------------
10 10 1618 3 3 3 8 8 8
Right now for Stacked Ensemble, the method is still there (it's present for any type of model), but we see this:
In [11]: ensemble.summary()
No model summary for this model
The summary()
method pulls the output from ensemble._model_output["model_summary"]
, so you don't actually need to add a new method, however we do need to populate this field. I can't find where this is created in Python, so I think that might mean that this text is created/stored on the Java side. @michalkurka do you know where to modify/set the this information?
Another way to solve this in pure Python (in R we did this on the client side as well) is to create a hidden method for SEs to store the info (e.g. ensemble._ensemble_summary
) and then modify the generic summary()
method to pull that info if it's a Stacked Ensemble. I am also not sure how to add this extra argument, base_model_detail
, if we re-use the generic summary()
method. I'm sure it's doable, but I don't know how to advise to add it only for SE's. @michalkurka Please let us know what route you think is best.
Sorry, I didn't meant to hit "close". Re-opening...
I'm just adding the current output here for reference:
In [21]: ensemble.model_summary()
Base Model Information:
Number of Base Models: 2
Base Models (count by algorithm type):
gbm drf
----- -----
1 1
[1 row x 2 columns]
Metalearner Information:
Metalearner Algorithm: glm
In [22]: ensemble.model_summary(base_model_detail=True)
Base Model Information:
Number of Base Models: 2
Base Models (count by algorithm type):
gbm drf
----- -----
1 1
[1 row x 2 columns]
Base model details:
gbm:
number_of_trees min_leaves mean_depth max_leaves model_size_in_bytes min_depth mean_leaves number_of_internal_trees max_depth
----------------- ------------ ------------ ------------ --------------------- ----------- ------------- -------------------------- -----------
10 8 3 8 1618 3 8 10 3
[1 row x 9 columns]
drf:
number_of_trees min_leaves mean_depth max_leaves model_size_in_bytes min_depth mean_leaves number_of_internal_trees max_depth
----------------- ------------ ------------ ------------ --------------------- ----------- ------------- -------------------------- -----------
50 1402 20 1583 943605 20 1496.52 50 20
[1 row x 9 columns]
Metalearner Information:
Metalearner Algorithm: glm