sparkling-water icon indicating copy to clipboard operation
sparkling-water copied to clipboard

How to export the AutoMl Models on real-time?

Open KunfuPanda24 opened this issue 2 years ago • 4 comments

In h2o automl training, the models are trained in sequentially, so is there any way i can export the models as soon as it completes, instead of waiting for the completion of all the model training. One of the idea i got from the h2o slack team is to run a parallel script, which will export the models from h2o cluster. I have attached the script below.

import re
import time

import h2o

h2o.connect()
exported_models = set()

project_name = "AutoML_2"
rx = re.compile(f"^(DeepLearning|DRF|GBM|GLM|StackedEnsemble|XGBoost).*{project_name}.*$")
while True:
    for l in h2o.ls().values:
        if rx.match(l[0]) is not None and "_cv_" not in l[0]:
            if l[0] not in exported_models:
                print(l[0])
                if "_grid_" in l[0] and "_model_" not in l[0]: continue
                exported_models.add(l[0])
                h2o.get_model(l[0]).download_model("path_somewhere")
                exported_models.add(l[0])
    time.sleep(10)

In this script, what will be the project name? And I am not able to export any models using this script?. Any help on this issue would be helpful.

KunfuPanda24 avatar May 11 '22 08:05 KunfuPanda24

@mn-mikke do you expose functions like h2o.automl.get_automl(project_name) in the Py client for SW? This would make things easier for @gurumoorthy208524.

Currently the AutoML API doesn't allow to directly list trained models, but there are ways to fetch them in a separate thread/process (e.g. in a separate script as suggested above):

# this is h2o-3 py client code, not SW py client code!!!
import h2o

h2o.connect()
aml = h2o.automl.get_automl("my_aml_project")
lb = h2o.automl.get_leaderboard(aml)
model_ids = [lb[i, 'model_id'] for i in range(lb.nrows)]
# models = [h2o.get_model[mid] for mid in model_ids]
for mid in model_ids:
  h2o.get_model(mid).download_model("dest_folder_path")

Note that it's still not "live", on top of the logic above, duplicates need to be detected and it would be better if automl could notify subscribers when a new model has been built. With the current event log mechanism, AutoML could easily trigger an event each time a model is added to the leaderboard and allow clients to subscribe to those events and have the event handler function running in a separate Py thread. If users are interested in such feature, we could add that: would you have to duplicate this for the SW Py client or are you somehow able to reuse h2o-3 one?

Another possibility would be to allow the "main" autoML training monitoring logic running in a subthread instead of the main one, but encouraging full access to AutoML instance during training would probably create more issues.

sebhrusen avatar May 11 '22 13:05 sebhrusen

Sparkling Water API doesn't have an equivalent method to h2o.automl.get_automl(project_name), but I don't see any reason why this method as such couldn't be combined with SW API. H2OAutoML in SW API and its derivatives H2OAutoMLClassifier and H2OAutoMLRegressorhave a setter method setProjectName, which you can use for defining a custom project name.

mn-mikke avatar May 11 '22 14:05 mn-mikke

so @mn-mikke there is no way to save the model using separate thread currently. Correct me if i am wrong or do we have any alternate options?

KunfuPanda24 avatar May 12 '22 13:05 KunfuPanda24

@gurumoorthy208524 you should be able to combine the code snippet from @sebhrusen with SW API. You just need to set own projectName on H2OAutoMLClassifier or H2OAutoMLRegressor

mn-mikke avatar May 26 '22 10:05 mn-mikke