sparkling-water
sparkling-water copied to clipboard
How to export the AutoMl Models on real-time?
In h2o automl training, the models are trained in sequentially, so is there any way i can export the models as soon as it completes, instead of waiting for the completion of all the model training. One of the idea i got from the h2o slack team is to run a parallel script, which will export the models from h2o cluster. I have attached the script below.
import re
import time
import h2o
h2o.connect()
exported_models = set()
project_name = "AutoML_2"
rx = re.compile(f"^(DeepLearning|DRF|GBM|GLM|StackedEnsemble|XGBoost).*{project_name}.*$")
while True:
for l in h2o.ls().values:
if rx.match(l[0]) is not None and "_cv_" not in l[0]:
if l[0] not in exported_models:
print(l[0])
if "_grid_" in l[0] and "_model_" not in l[0]: continue
exported_models.add(l[0])
h2o.get_model(l[0]).download_model("path_somewhere")
exported_models.add(l[0])
time.sleep(10)
In this script, what will be the project name? And I am not able to export any models using this script?. Any help on this issue would be helpful.
@mn-mikke do you expose functions like h2o.automl.get_automl(project_name)
in the Py client for SW?
This would make things easier for @gurumoorthy208524.
Currently the AutoML API doesn't allow to directly list trained models, but there are ways to fetch them in a separate thread/process (e.g. in a separate script as suggested above):
# this is h2o-3 py client code, not SW py client code!!!
import h2o
h2o.connect()
aml = h2o.automl.get_automl("my_aml_project")
lb = h2o.automl.get_leaderboard(aml)
model_ids = [lb[i, 'model_id'] for i in range(lb.nrows)]
# models = [h2o.get_model[mid] for mid in model_ids]
for mid in model_ids:
h2o.get_model(mid).download_model("dest_folder_path")
Note that it's still not "live", on top of the logic above, duplicates need to be detected and it would be better if automl could notify subscribers when a new model has been built. With the current event log mechanism, AutoML could easily trigger an event each time a model is added to the leaderboard and allow clients to subscribe to those events and have the event handler function running in a separate Py thread. If users are interested in such feature, we could add that: would you have to duplicate this for the SW Py client or are you somehow able to reuse h2o-3 one?
Another possibility would be to allow the "main" autoML training monitoring logic running in a subthread instead of the main one, but encouraging full access to AutoML instance during training would probably create more issues.
Sparkling Water API doesn't have an equivalent method to h2o.automl.get_automl(project_name)
, but I don't see any reason why this method as such couldn't be combined with SW API. H2OAutoML
in SW API and its derivatives H2OAutoMLClassifier
and H2OAutoMLRegressor
have a setter method setProjectName
, which you can use for defining a custom project name.
so @mn-mikke there is no way to save the model using separate thread currently. Correct me if i am wrong or do we have any alternate options?
@gurumoorthy208524 you should be able to combine the code snippet from @sebhrusen with SW API. You just need to set own projectName
on H2OAutoMLClassifier
or H2OAutoMLRegressor