h2o-3
h2o-3 copied to clipboard
Model attributes not being populated in Python grid search models
If you train a grid, the model parameters (attributes of the {{H2OGradientBoostingEstimator}} class) are not being set.
Example: {code} import h2o h2o.init()
csv_url = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv" prostate = h2o.import_file(csv_url)
prostate['CAPSULE'] = prostate['CAPSULE'].asfactor() prostate['RACE'] = prostate['RACE'].asfactor() prostate['DCAPS'] = prostate['DCAPS'].asfactor() prostate['DPROS'] = prostate['DPROS'].asfactor() x = range(2,9) y = 1 {code}
Train a non-grid model: {code}
Import H2O GBM:
from h2o.estimators.gbm import H2OGradientBoostingEstimator
model = H2OGradientBoostingEstimator(distribution='bernoulli', ntrees=100, max_depth=4, learn_rate=0.1, nfolds=5, keep_cross_validation_predictions=True)
model.train(x=x, y=y, training_frame=prostate)
model.nfolds #this is 5 {code}
However, if we train models via grid search, the model params are blank: {code} ntrees_opt = [5,50,100] max_depth_opt = [2,3,5] learn_rate_opt = [0.1,0.2]
hyper_params = {'ntrees': ntrees_opt, 'max_depth': max_depth_opt, 'learn_rate': learn_rate_opt}
from h2o.grid.grid_search import H2OGridSearch from h2o.estimators.gbm import H2OGradientBoostingEstimator
gbm_grid = H2OGridSearch(H2OGradientBoostingEstimator(nfolds=5, keep_cross_validation_predictions=True), hyper_params = hyper_params)
gbm_grid.train(x=x, y=y, training_frame=prostate)
gbm_grid[0].nfolds ## this is currently blank, and should be 5 {code} This happens for all the model parameters... they are all blank on the models that were trained using grid.
Lauren DiPerna commented: currently if you want to extract a model (for example the first model) attribute from grid search you have to do:
{code} print(sorted_grid[0].params['lambda']['actual'][0])
{code} but you should be able to do the following (which you can do in R) {code} model_1 = h2o.get_model("Grid_GLM_py_4_sid_9dd9_model_python_1481244882436_2_model_4")
model_1.show()
model_1.alpha {code}
here is another snippet to run and see the issues:
{code} import h2o from h2o.estimators.glm import H2OGeneralizedLinearEstimator h2o.init()
import the boston dataset:
this dataset looks at features of the boston suburbs and predicts median housing prices
the original dataset can be found at https://archive.ics.uci.edu/ml/datasets/Housing
boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
set the predictor names and the response column name
predictors = boston.columns[:-1]
set the response column to "medv", the median value of owner-occupied homes in $1000's
response = "medv"
convert the chas column to a factor (chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise))
boston['chas'] = boston['chas'].asfactor()
split into train and validation sets
train, valid = boston.split_frame(ratios = [.8], seed = 1234)
try using the alpha
parameter:
initialize the estimator then train the model
boston_glm = H2OGeneralizedLinearEstimator(alpha = .25, seed = 1234) boston_glm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
print the mse for validation set
print(boston_glm.mse(valid=True))
grid over alpha
import Grid Search
from h2o.grid.grid_search import H2OGridSearch
select the values for alpha
to grid over
hyper_params = {'alpha': [0, .25, .5, .75, .1]}
this example uses cartesian grid search because the search space is small
and we want to see the performance of all models. For a larger search space use
random grid search instead: {'strategy': "RandomDiscrete"}
initialize the GLM estimator
boston_glm_2 = H2OGeneralizedLinearEstimator(nfolds = 5, seed = 1234)
build grid search with previously made GLM and hyperparameters
grid = H2OGridSearch(model = boston_glm_2, hyper_params = hyper_params, search_criteria = {'strategy': "Cartesian"})
train using the grid
grid.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
grid.train(x = predictors, y = response, training_frame = train)
sort the grid models by mse
sorted_grid = grid.get_grid(sort_by='mse', decreasing=False) print(sorted_grid)
#this works print(boston_glm.alpha)
#get the type print(type(sorted_grid))
can you get the model summary from a model extracted from the grid search?
sorted_grid[0] # answer: yes
sorted_grid[0].params['lambda']['actual'][0] # this is the only way to get the attributes {code}
Lauren DiPerna commented: this is not an issue in R using the following code {code} library(h2o) h2o.init()
import the boston dataset:
this dataset looks at features of the boston suburbs and predicts median housing prices
the original dataset can be found at https://archive.ics.uci.edu/ml/datasets/Housing
boston <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
set the predictor names and the response column name
predictors <- colnames(boston)[1:13]
set the response column to "medv", the median value of owner-occupied homes in $1000's
response <- "medv"
convert the chas column to a factor (chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise))
boston["chas"] <- as.factor(boston["chas"])
split into train and validation sets
boston.splits <- h2o.splitFrame(data = boston, ratios = .8, seed = 1234) train <- boston.splits[[1]] valid <- boston.splits[[2]]
try using the alpha
parameter:
train your model, where you specify the alpha
boston_glm <- h2o.glm(x = predictors, y = response, training_frame = train, validation_frame = valid, alpha = .25, seed = 1234)
print the mse for validation set
print(h2o.mse(boston_glm, valid=TRUE))
grid over alpha
select the values for alpha
to grid over
hyper_params <- list( alpha = c(0, .25, .5, .75, .1) )
this example uses cartesian grid search because the search space is small
and we want to see the performance of all models. For a larger search space use
random grid search instead: {'strategy': "RandomDiscrete"}
build grid search with previously made GLM and hyperparameters
grid <- h2o.grid(x = predictors, y = response, training_frame = train, validation_frame = valid, algorithm = "glm", grid_id = "boston_grid", hyper_params = hyper_params, search_criteria = list(strategy = "Cartesian"), seed = 1234)
Sort the grid models by mse
sortedGrid <- h2o.getGrid("boston_grid", sort_by = "mse", decreasing = FALSE) sortedGrid {code}
you can do
{code} sortedGrid@model_ids model_1 = h2o.getModel("boston_grid_model_4") model_1@allparameters$prior {code}
if you do the same thing in python it doesn't work
JIRA Issue Migration Info
Jira Issue: PUBDEV-2465 Assignee: New H2O Bugs Reporter: Erin LeDell State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A