hamilton icon indicating copy to clipboard operation
hamilton copied to clipboard

🚧 Adds MLflow materializer

Open bryangalindo opened this issue 1 year ago • 8 comments

🚧 WIP 🚧

Changes

How I tested this

Notes

Checklist

  • [ ] PR has an informative and human-readable title (this will be pulled into the release notes)
  • [ ] Changes are limited to a single goal (no scope creep)
  • [ ] Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • [ ] Any change in functionality is tested
  • [ ] New functions are documented (with a description, list of inputs, and expected output)
  • [ ] Placeholder code is flagged / future TODOs are captured in comments
  • [ ] Project documentation has been updated if adding/changing functionality.

bryangalindo avatar Sep 18 '23 21:09 bryangalindo

how to save a model into mlflow: https://mlflow.org/docs/latest/quickstart.html#store-a-model-in-mlflow how to load a model from mlflow: https://mlflow.org/docs/latest/quickstart.html#load-a-model-from-a-specific-training-run-for-inference

bryangalindo avatar Sep 19 '23 04:09 bryangalindo

model flavors can be found here or below (but missing crate?)

>>> import mlflow
>>> mlflow.__version__
'2.7.1'
>>> [attr for attr in dir(mlflow) if hasattr(getattr(mlflow, attr), 'log_model')]
[
    'catboost', 'diviner', 'fastai', 'gluon', 'h2o', 'johnsnowlabs', 'langchain', 
    'lightgbm', 'mleap', 'onnx', 'openai', 'paddle', 'pmdarima', 'prophet', 
    'pyfunc', 'pytorch', 'sentence_transformers', 'sklearn', 'spacy', 'spark', 
    'statsmodels', 'tensorflow', 'transformers', 'xgboost'
]

top three flavors (probably): sklearn, tensorflow, pytorch. no hard data, just vibes.

bryangalindo avatar Sep 19 '23 05:09 bryangalindo

example of load/save flow for sklearn model flavor from mlflow quickstart.

from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

import mlflow
from mlflow.models import infer_signature

run_id = None

db = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(db.data, db.target)

with mlflow.start_run() as run:
    rf = RandomForestRegressor(n_estimators=100, max_depth=6, max_features=3)
    rf.fit(X_train, y_train)
    save_predictions = rf.predict(X_test)
    signature = infer_signature(X_test, save_predictions)
    mlflow.sklearn.log_model(rf, "model", signature=signature)
    run_id = run.info.run_id

model = mlflow.sklearn.load_model(f"runs:/{run_id}")
load_predictions = model.predict(X_test)

assert save_predictions == load_predictions

disclaimer: I have not tested this

bryangalindo avatar Sep 19 '23 05:09 bryangalindo

@bryangalindo we should come up with the Hamilton UX to help guide this. i.e. what's the API we want to expose for Hamilton?

skrawcz avatar Sep 19 '23 06:09 skrawcz

@bryangalindo we should come up with the Hamilton UX to help guide this. i.e. what's the API we want to expose for Hamilton?

Ok let's chat during our sync. Thanks!

bryangalindo avatar Sep 19 '23 17:09 bryangalindo

High-level tasks:

Analysis:

  • [ ] (1 hour) Create "hello, world!" version of load_model/log_model to understand mlflow (debug, print stmts, etc).
  • [ ] (30 min) Observe directories/files created from log_model (see hamilton/plugins/mlruns/0/0b9e9b23c3ef443ba638d23e4318b58e)
  • [x] (3 hours) Get high-level understanding of Hamilton driver, see hamilton/driver.py.
  • [ ] (3 hours) Get high-level understanding of e.g., regressors
  • [ ] (1 hour) Read through files in examples/materialization
  • [x] (15 min) Decide on what reader/writer type makes sense (e.g., MLflowRegressorReader/MLflowRegressorWriter)
  • [ ] (15 min) Decide on the applicable type (e.g., dataframe, classifiers, regressors)
  • [ ] (30 min) Decide what metadata to save from model (see hamilton/plugins/mlruns/0/0b9e9b23c3ef443ba638d23e4318b58e/artifacts/model)
  • [x] (15 min) Discover kwargs for log_model and load_model .

Reader/Writer Development:

  • [ ] (2 hours) Write reader
  • [ ] (2 hours) Write writer
  • [ ] (1 hour) Write get metadata function
  • [ ] (1 hour) Write unit tests for get metadata function
  • [ ] (1 hour) Write unit tests for reader/writer

Materializer Development:

  • [ ] (30 min) Write data loader module (see examples/materialization/data_loaders.py)
  • [ ] (1 hour) Write model_training module (see examples/materialization/model_training.py)
  • [ ] (1 hour) Write run.py module (see examples/materialization/run.py)
  • [ ] (1 hour) Write jupyter notebook example (see examples/materialization/notebook.ipynb)
  • [ ] (5 min) Write requirements.txt

bryangalindo avatar Sep 21 '23 22:09 bryangalindo

Hey @bryangalindo -- a thought on a feature that might be helpful. Here's an outline of what the API should look like -- the data saver/materialization implementatino should support this.

from hamilton.function_modifiers import source
dr = driver.Driver(...)
dr.materialize(
   to.mlflow(
      id="mlflow_save",
      dependencies=["my_cool_model"],
      model_input=source("training_data"),
      model_output=source("predictions"),
   )
)

Then the materializer would call infer_signature with the model_input and model_output -- these would be taken from nodes called training_data and predictions. The DAG would look like:

training_data -> mlflow_save
predictions -> mlflow_save

and possibly more connections. Does this make sense? This is all supported btw -- materializers can take in source/value type parameters -- if they take in one that isn't it, they'll just resolve to a value.

elijahbenizzy avatar Sep 28 '23 18:09 elijahbenizzy

Closed in favor of #945

zilto avatar Jun 12 '24 23:06 zilto