h2o-3 icon indicating copy to clipboard operation
h2o-3 copied to clipboard

PUBDEV-3911: stackedensemble model summary

Open jdubchak opened this issue 6 years ago • 5 comments

Code for JIRA Pubdev-3911 Add model summary for Stacked Ensembles in Python API. The code is based on the existing stacked ensemble model summary in R PUBDEV-5462 Add model summary in Stacked Ensemble R binding.

Code Additions

All code is contained in the model_summary function located in the h2o-bindings/bin/gen_python.py file. Tests are located in h2o-py/tests/testdir_algos/stackedensemble/pyunit_stackedensemble_modelsummary.py file.

Like the stacked ensemble model summary in R, this code prints base model information and metalearner information.

The base model information includes:

  • The number of base models used by the ensemble (int)
  • The count of occurrences each base model algorithm type (displayed in H2OFrame)
  • If base_model_detail=True is set, model summaries for each base model are outputted (displayed in H2OFrames). This is the only code that does not currently exist in the R stacked ensemble model summary.

Metalearner information includes:

  • The metalearner algorithm (string)
  • If the information exists (i.e. is not None), the number of folds used by the metalearner (int) and the fold assignment
  • If the information exists, the parameters of the metalearner

Since the outputted information includes at least one H2OFrame, this function relies on print statements to display information to the user.

Usage

Using the stacked ensemble example from the stacked ensemble documentation:

# Train a stacked ensemble using the GBM and GLM above
ensemble = H2OStackedEnsembleEstimator(model_id="my_ensemble_binomial",
                                       base_models=[my_gbm, my_rf])
ensemble.train(x=x, y=y, training_frame=train)

# Show ensemble model summary 
ensemble.model_summary() # or..
ensemble.model_summary(base_model_detail=True) # additionally outputs summaries of the base models 

Other information

Information on system and packages:

  • macOS High Sierra Version 10.13.5
  • Python 3.7.0
  • Package versions:
    • colorama==0.3.9
    • future==0.16.0
    • h2o==3.21.0.99999
    • pandas==0.23.1
    • requests==2.19.1
    • scipy==1.1.0
    • tabulate==0.8.2

Local build executed using the following commands on the command line in a virtual environment:

git clone <forked H2O repo> cd h2o-3: ./gradlew build -x test gradle installDist cd h2o-py python setup.py install

jdubchak avatar Jul 23 '18 05:07 jdubchak

@jdubchak great contribution, thank you!!!

michalkurka avatar Jul 27 '18 18:07 michalkurka

@michalkurka Thank you.

jdubchak avatar Jul 27 '18 19:07 jdubchak

@jdubchak This is great! The output looks good and I think the use of an extra argument, base_model_detail, is an excellent addition.

I think the only thing that we should change is that the method should be called .summary() instead of .model_summary(). Sorry if the JIRA was confusing, when I said "model_summary", I was referring to the R method which is called that. In Python, it's called .summary() and we already use .summary() for regular (non-ensemble) models. Example:

In [10]: my_gbm.summary()
Out[10]: Model Summary:
    number_of_trees    number_of_internal_trees    model_size_in_bytes    min_depth    max_depth    mean_depth    min_leaves    max_leaves    mean_leaves
--  -----------------  --------------------------  ---------------------  -----------  -----------  ------------  ------------  ------------  -------------
    10                 10                          1618                   3            3            3             8             8             8

Right now for Stacked Ensemble, the method is still there (it's present for any type of model), but we see this:

In [11]: ensemble.summary()
No model summary for this model

The summary() method pulls the output from ensemble._model_output["model_summary"], so you don't actually need to add a new method, however we do need to populate this field. I can't find where this is created in Python, so I think that might mean that this text is created/stored on the Java side. @michalkurka do you know where to modify/set the this information?

Another way to solve this in pure Python (in R we did this on the client side as well) is to create a hidden method for SEs to store the info (e.g. ensemble._ensemble_summary) and then modify the generic summary() method to pull that info if it's a Stacked Ensemble. I am also not sure how to add this extra argument, base_model_detail, if we re-use the generic summary() method. I'm sure it's doable, but I don't know how to advise to add it only for SE's. @michalkurka Please let us know what route you think is best.

ledell avatar Jul 28 '18 04:07 ledell

Sorry, I didn't meant to hit "close". Re-opening...

ledell avatar Jul 28 '18 04:07 ledell

I'm just adding the current output here for reference:

In [21]: ensemble.model_summary()
Base Model Information:

Number of Base Models: 2
Base Models (count by algorithm type):
  gbm    drf
-----  -----
    1      1

[1 row x 2 columns]


Metalearner Information:

Metalearner Algorithm: glm

In [22]: ensemble.model_summary(base_model_detail=True)
Base Model Information:

Number of Base Models: 2
Base Models (count by algorithm type):
  gbm    drf
-----  -----
    1      1

[1 row x 2 columns]

Base model details:
gbm:
  number_of_trees    min_leaves    mean_depth    max_leaves    model_size_in_bytes    min_depth    mean_leaves    number_of_internal_trees    max_depth
-----------------  ------------  ------------  ------------  ---------------------  -----------  -------------  --------------------------  -----------
               10             8             3             8                   1618            3              8                          10            3

[1 row x 9 columns]

drf:
  number_of_trees    min_leaves    mean_depth    max_leaves    model_size_in_bytes    min_depth    mean_leaves    number_of_internal_trees    max_depth
-----------------  ------------  ------------  ------------  ---------------------  -----------  -------------  --------------------------  -----------
               50          1402            20          1583                 943605           20        1496.52                          50           20

[1 row x 9 columns]


Metalearner Information:

Metalearner Algorithm: glm

ledell avatar Jul 28 '18 04:07 ledell