h2o-3 icon indicating copy to clipboard operation
h2o-3 copied to clipboard

Model attributes not being populated in Python grid search models

Open exalate-issue-sync[bot] opened this issue 1 year ago • 3 comments

If you train a grid, the model parameters (attributes of the {{H2OGradientBoostingEstimator}} class) are not being set.

Example: {code} import h2o h2o.init()

csv_url = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv" prostate = h2o.import_file(csv_url)

prostate['CAPSULE'] = prostate['CAPSULE'].asfactor() prostate['RACE'] = prostate['RACE'].asfactor() prostate['DCAPS'] = prostate['DCAPS'].asfactor() prostate['DPROS'] = prostate['DPROS'].asfactor() x = range(2,9) y = 1 {code}

Train a non-grid model: {code}

Import H2O GBM:

from h2o.estimators.gbm import H2OGradientBoostingEstimator

model = H2OGradientBoostingEstimator(distribution='bernoulli', ntrees=100, max_depth=4, learn_rate=0.1, nfolds=5, keep_cross_validation_predictions=True)

model.train(x=x, y=y, training_frame=prostate)

model.nfolds #this is 5 {code}

However, if we train models via grid search, the model params are blank: {code} ntrees_opt = [5,50,100] max_depth_opt = [2,3,5] learn_rate_opt = [0.1,0.2]

hyper_params = {'ntrees': ntrees_opt, 'max_depth': max_depth_opt, 'learn_rate': learn_rate_opt}

from h2o.grid.grid_search import H2OGridSearch from h2o.estimators.gbm import H2OGradientBoostingEstimator

gbm_grid = H2OGridSearch(H2OGradientBoostingEstimator(nfolds=5, keep_cross_validation_predictions=True), hyper_params = hyper_params)

gbm_grid.train(x=x, y=y, training_frame=prostate)

gbm_grid[0].nfolds ## this is currently blank, and should be 5 {code} This happens for all the model parameters... they are all blank on the models that were trained using grid.

exalate-issue-sync[bot] avatar May 13 '23 18:05 exalate-issue-sync[bot]

Lauren DiPerna commented: currently if you want to extract a model (for example the first model) attribute from grid search you have to do:

{code} print(sorted_grid[0].params['lambda']['actual'][0])

{code} but you should be able to do the following (which you can do in R) {code} model_1 = h2o.get_model("Grid_GLM_py_4_sid_9dd9_model_python_1481244882436_2_model_4")

model_1.show()

model_1.alpha {code}

here is another snippet to run and see the issues:

{code} import h2o from h2o.estimators.glm import H2OGeneralizedLinearEstimator h2o.init()

import the boston dataset:

this dataset looks at features of the boston suburbs and predicts median housing prices

the original dataset can be found at https://archive.ics.uci.edu/ml/datasets/Housing

boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")

set the predictor names and the response column name

predictors = boston.columns[:-1]

set the response column to "medv", the median value of owner-occupied homes in $1000's

response = "medv"

convert the chas column to a factor (chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise))

boston['chas'] = boston['chas'].asfactor()

split into train and validation sets

train, valid = boston.split_frame(ratios = [.8], seed = 1234)

try using the alpha parameter:

initialize the estimator then train the model

boston_glm = H2OGeneralizedLinearEstimator(alpha = .25, seed = 1234) boston_glm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

print the mse for validation set

print(boston_glm.mse(valid=True))

grid over alpha

import Grid Search

from h2o.grid.grid_search import H2OGridSearch

select the values for alpha to grid over

hyper_params = {'alpha': [0, .25, .5, .75, .1]}

this example uses cartesian grid search because the search space is small

and we want to see the performance of all models. For a larger search space use

random grid search instead: {'strategy': "RandomDiscrete"}

initialize the GLM estimator

boston_glm_2 = H2OGeneralizedLinearEstimator(nfolds = 5, seed = 1234)

build grid search with previously made GLM and hyperparameters

grid = H2OGridSearch(model = boston_glm_2, hyper_params = hyper_params, search_criteria = {'strategy': "Cartesian"})

train using the grid

grid.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

grid.train(x = predictors, y = response, training_frame = train)

sort the grid models by mse

sorted_grid = grid.get_grid(sort_by='mse', decreasing=False) print(sorted_grid)

#this works print(boston_glm.alpha)

#get the type print(type(sorted_grid))

can you get the model summary from a model extracted from the grid search?

sorted_grid[0] # answer: yes

sorted_grid[0].params['lambda']['actual'][0] # this is the only way to get the attributes {code}

exalate-issue-sync[bot] avatar May 13 '23 18:05 exalate-issue-sync[bot]

Lauren DiPerna commented: this is not an issue in R using the following code {code} library(h2o) h2o.init()

import the boston dataset:

this dataset looks at features of the boston suburbs and predicts median housing prices

the original dataset can be found at https://archive.ics.uci.edu/ml/datasets/Housing

boston <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")

set the predictor names and the response column name

predictors <- colnames(boston)[1:13]

set the response column to "medv", the median value of owner-occupied homes in $1000's

response <- "medv"

convert the chas column to a factor (chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise))

boston["chas"] <- as.factor(boston["chas"])

split into train and validation sets

boston.splits <- h2o.splitFrame(data = boston, ratios = .8, seed = 1234) train <- boston.splits[[1]] valid <- boston.splits[[2]]

try using the alpha parameter:

train your model, where you specify the alpha

boston_glm <- h2o.glm(x = predictors, y = response, training_frame = train, validation_frame = valid, alpha = .25, seed = 1234)

print the mse for validation set

print(h2o.mse(boston_glm, valid=TRUE))

grid over alpha

select the values for alpha to grid over

hyper_params <- list( alpha = c(0, .25, .5, .75, .1) )

this example uses cartesian grid search because the search space is small

and we want to see the performance of all models. For a larger search space use

random grid search instead: {'strategy': "RandomDiscrete"}

build grid search with previously made GLM and hyperparameters

grid <- h2o.grid(x = predictors, y = response, training_frame = train, validation_frame = valid, algorithm = "glm", grid_id = "boston_grid", hyper_params = hyper_params, search_criteria = list(strategy = "Cartesian"), seed = 1234)

Sort the grid models by mse

sortedGrid <- h2o.getGrid("boston_grid", sort_by = "mse", decreasing = FALSE) sortedGrid {code}

you can do

{code} sortedGrid@model_ids model_1 = h2o.getModel("boston_grid_model_4") model_1@allparameters$prior {code}

if you do the same thing in python it doesn't work

exalate-issue-sync[bot] avatar May 13 '23 18:05 exalate-issue-sync[bot]

JIRA Issue Migration Info

Jira Issue: PUBDEV-2465 Assignee: New H2O Bugs Reporter: Erin LeDell State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A

DinukaH2O avatar May 15 '23 10:05 DinukaH2O